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Many applied psychologists will have occa- 
sion to furnish advice on proper illumination 
intensities for offices, schools, libraries, facto- 
ries, homes and other situations where visual 
work is done. Practical considerations have 
led to the publication of lighting codes and 
pamphlets on standard or recommended prac- 
tice for the guidance of those interested in pro- 
viding adequate illumination for visual work 
in one or another situation. Toa large degree, 
the illumination intensities specified in these 
publications are values derived by computation 
from data based upon threshold discrimination 
as in visual acuity studies or in analogous types 
of investigations (6). 

The present study is concerned with an 
evaluation of the Weston-Crouch system of 
computing illumination intensities for a speci- 
fied per cent of maximum visual performance. 

Weston, in two elaborate studies on the 
- effect of size of work (8) and the effect of 
brightness contrast (10) and in a discussion (9), 
has presented basic data and formulated a 
method for deriving the illumination intensity 
necessary for 90, 95, 98 or any other desired 
per cent of maximum visual efficiency taking 
into consideration size of detail to be dis- 
criminated and the brightness contrast be- 
tween object to be discriminated and its back- 
ground. Crouch (1) has given the technique 
more definite formulation and has devised a 
nomogram to facilitate the computation. The 
Holophane Calculux (12) is a practical tool 
which may be used instead of the nomogram. 
It includes a size gauge for determining size of 
detail, a reflection factor gauge for determining 
reflection factors of detail and of background, 
and a circular slide-rule arrangement which 


* Grateful acknowledgment is given to the Graduate 
School, University of Minnesota for research grant to 
finance this study. 
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permits settings in terms of size of detail and 
reflection factors to determine illumination 
intensities for 90, 95, and 98 per cent of visual 
efficiency. The readings are in foot candles. 

Weston employed a performance test with a 
time limit of one minute. The task was to dis- 
criminate and cancel printed Landolt broken 
rings having a given gap orientation. Since 
reaction time was eliminated, this task was 
virtually a visual acuity test. Furthermore, 
it is a performance test. The efficiency or 
performance index employed takes into con- 
sideration both accuracy and speed. Results 
were obtained for discriminating details rang- 
ing from one to six minutes in size with a wide 
range of illumination intensities and of bright- 
ness contrasts. The data obtained provided 
the basis for the formula (and nomogram) 
employed to compute working illuminations 
needed for a specified per cent of visual effi- 
ciency. 

Whenever specifications of illumination for 
a given visual task are arrived at through com- 
putations, the question may be raised whether 
this computed value would be the same as an 
experimentally determined value. Thus, is 
the experimentally determined level of iliumi- 
nation which produces fastest perception and 
reading of print the same as the computed in- 
tensity derived for this task by Weston and 
Crouch’s technique? The purpose of this 
paper is to present evidence on this question. 
The illumination intensities necessary for fast- 
est reading of newsprint and of book print will 


be compared with the corresponding intensities | 


derived through computation. 

The illumination beyond which there was no 
additional change in reading speed as the in- 
tensity was further increased was determined 
for reading seven point newspaper type set 
solid in a 12 pica line width on newsprint paper 
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stock. Illumination ranged from one to 100 
foot candles. Testing was done individually 
in a light laboratory. There were 405 readers 
divided into seven sub-groups of 53 to 70 uni- 
versity students each. The reading material 
consisted of Forms A and B of the Chapman- 
Cook Speed of Reading Test. Accuracy of per- 
formance is held constant at virtually 100 per 
cent. Paterson and Tinker’s results (2, 3) 
show that on the average, university students 
maintain an accuracy of 99.7 per cent. The 
only variable, therefore, is speed of perception 
in reading. Details of experimental design, 
procedures and results are given elsewhere by 
Tinker (5). 

Results 


Trends in the results are shown in Table 1. 
In each sub-group, reading under a specific 
intensity of light was compared with reading 
under 10 foot candles which constituted the 
standard condition for reading Form A. 
Group IV is the control group in which 10 foot 
candles of light were used for reading both 
Forms A and B of the test. The results in 
Group IV reveal that Forms A and B yielded 
equivalent results under the conditions of the 
control group. Any significant differences in 
the experimental groups, therefore, should be 
due to differences in illumination intensities 
compared. 

Levels of significance for differences are 
given in the right hand column of the table. 


Table 1 
Effect of Light Intensities Upon Speed of 
Reading Newsprint 


Note: Differences given are for mean score on Form 
A minus mean score on Form B, Chapman-Cook Speed 
_ of Reading Test printed in Ionic No. 5 newspaper type, 


seven point set solid, 12 pica line width on newsprint 
: paper stock. N =: 4(5 university students. 


Difference Level of 
Test i 


Foot Candle in Para- _Signifi- 

» Group Comparison N graphs cance 
I 10vs. 1 61 98 1 
II 10vs. 4 55 85 1 
Ill 10vs. 7 70 07 90 
IV 10 vs. 10 55 —.05 90 

Vv 10 vs. 20 53 
VI 10 vs. 50 55 .24 50 
VII 10 vs. 100 56 -13 30 
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Table 2 


Effect of Light Intensities Upon Speed of 
Reading Book Type 


Note: Differences given are for mean score on Form 
A minus mean score on Form B, Chapman-Cook Speed 
of Reading Test printed in Scotch Roman book type, 
10 point set solid, 19 pica line width on eggshell paper 
stock. N = 432 university sophomores. 


Difference Level of 


Test Foot Candle inPara- Signifi- 
Group Comparison N graphs cance 
I 10.3 vs. 0.1 72 1.65 1 
I 10.3 vs. 0.7 72 88 1 
Il 10.3 vs. 3.1 72 10 36 
IV 10.3 vs. 10.3 72 —.15 24 
Vv 10.3 vs. 17.4 72 25 14 
VI 10.3 vs. 53.3 72 —.13 31 


Both one and four foot candles of light reduce 
significantly (1 per cent level) the rate of read- 
ing newsprint. From seven to 100 foot 
candles, however, rate of reading remained 
constant. In other words, increasing illumi- 
nation intensity beyond seven foot candles had 
no additional beneficial effect upon visual 
effciency in reading newsprint. 

The reflection factor of the newspaper stock 
employed was 65 per cent and of the printing 
ink, five per cent. Size of detail in the news- 
print, the opening in the letter e, was 3 minutes 
of arc. Application of the Weston-Crouch 
technique yields the computed value of 8 foot 
candles for 95 per cent of maximum visual per- 
formance, 25 foot candles for 98 per cent, and 
approximately 250 foot candles for 100 per 
cent. 

Another comparison was concerned with 
book print set in 10 point solid on eggshell 
paper stock. In the speed of reading study 
432 university students served as subjects, 
divided into six subgroups of 72 each. The 
illumination used ranged from 0.1 to 53.3 foot 
candles. Again the reading material con- 
sisted of Forms A and B of the Chapman-Cook 
Speed of Reading Test. All testing was done 
individually. Details of experimental design, 
procedures and results are given by Tinker (4). 

Basic data for this study are given in Table 
2. As in the study cited above, reading under 
various intensities of illumination was com- 
pared with reading under a standard level of 
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illumination (10.3 foot candles). The results 
for Group IV, the control group, indicate that 
Forms A and B of the reading test yield equiva- 
lent results when experimental conditions are 
constant. 

Under both 0.1 and 0.7 foot candles speed of 
reading was retarded significantly (i per cent 
level). From 3.1 through 53.3 foot candles, 
however, speed of reading remained constant. 
Increasing the illumination beyond 3.1 foot 
candles, therefore, brought no further increase 
in visual efficiency in reading this 10 point book 
type. 

The reflection factor of the eggshell paper 
stock used was 75 per cent and of the printing 
ink, 5 per cent. Size of detail, the opening in 
the letter e, was 4 minutes of arc. Application 
of the Weston-Crouch technique yields the 
computed value of 3 foot candles for 95 per cent 
of maximum visual performance, 10 foot can- 
dies for 98 per cent, and approximately 100 
foot candles for 100 per cent. 

The discrepancies between the computed 
and the experimentally determined results 
while reading newsprint and book type are 
obvious. The computed foot candle levels are 
much higher than required for maximum effi- 
ciency of performance in the work situation it- 
self. In fact, the illumination intensity which 
the Weston-Crouch computing technique indi- 
cates is adequate for only 95 per cent of maxi- 
mum efficiency, turns out to be adequate for 
100 per cent of maximum efficiency when 
checked experimentally. An analysis of the 
situations which provide the basic data may 
help to explain these discrepancies. 

In the Weston (10) technique speed and ac- 
curacy enter into the performance score. If 
accuracy were 100 per cent, only rate of work 
would enter into this performance score. For 
a given size of detail and a given brightness 
contrast in such a case, the 95, 98, or 100 per 
cent of maximum visual efficiency would cor- 
respond to differences in rate of work. Pre- 
sumably, therefore, any illumination intensity 
significantly less than that computed to pro- 
duce 100 per cent visual efficiency would yield 
a slower performance rate. 

As noted above, the technique employed by 
Tinker yields performance of virtually 100 per 
cent accuracy (99.7). Performance, there- 
fore, is in terms of rate of work only. Accord- 


ing to the Weston-Crouch technique, maximum 
reading speed of the newsprint should be 
reached only with an illumination intensity of 
250 + 83 foot candles. (Since higher intensi- 
ties cannot be prescribed with great precision 
the British have adopted a tolerance factor of 
+4). Actually, 100 per cent of visual effi- 
ciency in reading the newsprint is achieved at 7 
foot candles. Similarly,. according to the 
Weston-Crouch computations, 100 per cent of 
maximum efficiency in reading 10 point book 
type would be reached with 100 + 33 foot can- 
dies of light. But experimental check reveals 
that 100 per cent of visual efficiency in reading 
this book type is achieved at 3.1 foot candles. 

It seems obvious that the Weston-Crouch 
computational technique of deriving levels of 
illumination intensity needed for effective 
visual discrimination is not valid in the situ- 
ations described here. A possible explanation 
of these discrepancies may involve the follow- 
ing: (1) The data upon which the Weston- 
Crouch technique is based are essentially meas- 
urements of visual acuity and involve, there- 
fore, threshold discrimination. In an earlier 
paper, Tinker (6) has questioned the validity 
of prescribing illumination intensities for visual 
discrimination in supra-threshold tasks on the 
basis of visual acuity data. The use of visual 
acuity data to prescribe light for many supra- 
threshold tasks is analogous to selecting brick- 
layers on the basis of a finger dexterity test. 
(2) Possibly it is not valid to make a direct 
transfer from visual acuity data to a seeing 
situation where integrated visual patterns are 
involved. In reading print, one is dealing with 
a complex structured situation in which fine 
visual discriminaticn is neither requirec| nor 
even used most of the time. Toa large degree 
the reader reacts to word forms rather than to 
the minute details of individual letters. Only 
occasionally, when the word form is not indi- 
vidually characteristic and when the verbal 
context does not clearly suggest the meaning, 
does the reader exercise discrimination of letter 
details such as when boot must be distinguished 
from boat, or these from there. (3) Possibly the 
formulas developed and used in the Weston- 
Crouch technique are not valid for computing 
illumination intensities needed for visual dis- 
crimination in any situations other than the 
one upon which the formulas are based. 
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It seems probable that many other illumina- 
tion specifications derived by computational 
techniques are also in error. It is certainly 
true for intensities specified for reading print. 
There is little reason te assu \e that computed 
intensities specified as necessary for other visual 
tasks will be substantiated by an experimental 
check. Certainly one may justifiably question 
such practices and urge that the experimental 
checks be made. 

As pointed out by Tinker (7), each revision 
of a standard practice or a recommended prac- 
tice is a revision upward in foot-candle require- 
ments. - Since these illumination specifications 
are largely derived through computation 
rather than experimentally determined, it is 
legitimate to question their validity. 

The British lighting code (13), which is also 
reproduced in part by Weston (11), is much 
more conservative than various recommended 
practices in the United States (6). Appar- 
ently this is due to the fact that the British 
have taken into account much experimental 
data and also because they have accepted as 
adequate the 90 per cent level of Weston’s 
maximum (11). 

Apparently illumination codes or a listing of 
recommended or standard practices are desir- 
able and necessary. It is obvious that such 
listings cannot be perfect at present. Never- 
theless, all relevant data should be considered 
when a “Recommended Practice of Home 
Lighting” or any other “‘code”’ is issued (6). 


Summary 


1. A comparison was made between the 
computed illumination intensity and experi- 
mentally derived intensity for reading news- 
paper print and for reading book print. 

2. The specifications derived through com- 
putations are excessively high in comparison 
with the experimental findings. 

3. The Weston-Crouch method of comput- 


Miles A. Tinker 


ing the illumination intensity necessary for 
effective visual discrimination in reading print 
is not a valid technique. Whether this com- 
putational method is valid for other situations 
remains to be checked by experiment. It is 
possible that the technique is not applicable 
where visual discrimination in other supra- 
threshold tasks is involved. 


Received February 2, 1951. 
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Influence of ‘Plain Talk’? on AMC Communications 


Arthur O. England 
Headquarters, Air Materiel Command, Dayton, Ohio 


In the June, 1950, issue of the Journal of 
Applied Psychology, the writer discussed the 
Air Materiel Command’s “plain talk” com- 
paign. One year has elapsed since the publica- 
tion of our AMC Manual 11-1, entitled ““Gob- 


* ble-de-gook or Plain Talk.” Over 11,000 


printed copies and 2,000 mimeographed copies 
of this manual have been distributed to super- 
visory personnel in the Air Materiel Command. 

An analysis of the readability of our current 
publications has just been made. This report 
illustrates what change in our publications has 
occurred in a year’s time. The Flesch formula 
was again used. Consistency was maintained 
in the type of printed matter sampled and the 
number of samples taken. 


Results 


Table 1 shows the improvement made in 
various AMC publications. The per cent gain 
in employee readership was obtained by com- 
paring the estimated per cent of employees with 
the educational level necessary to easily read 
the style of writing in January 1950 and Janu- 
ary 1951. It can be seen that a substantial 


gain was made in all AMC publications except 
the Technical Orders and Regulations. Since 
this Command has no jurisdiction over the 
publications that originate from Headquarters, 
United States Air Force or the Civil Service 
Commission, no attempt has been made to 
measure the readability of those publications 
at this time. 

The increase in readability of our publica- 
tions is of considerable concern to management. 
Now, more employees can easily read the ma- 
jority of our publications. Management’s 
message is reaching a larger employee audience. 
More employees waste less productive time in 
reading our publications. On the surface, it 
would seem like a fair assumption to make that 
more employees can now understand what 
management is trying to say. Substantial 
saving of productive time can be illustrated in 
dollars and cents. 

“Strictly Personnel,” a four-page publication 
dealing with personnel problems of vital inter- 
est to all civilian employees at Wright-Patter- 
son Air Force Base, is cited as an example of 
productive time saved with increased readabil- 


Table 1 
Comparison of Reading Ease Scores of AMC Publications between January 1950 and January 1951 


Per cent Gain 


Mean Mean in Employees 
Reading Reading Standard Standard Who Can 
Number Ease Ease Deviation Deviation Easily 
o Scores Scores 1950 1951 Read the 
Type of Publication Samples Jan.1950 Jan. 1951 Mean Mean Publication 
AMC Letters 50 27.7 36.0 10.3 9.8 28.5 
Hq Office Instructions 50 13.6 31.2 8.4 10.7 28.5 
AMC Daily Bulletins 60 24.6 36.5 9.4 11.9 63.8 Officers 
42.9 Airmen 
“Strictly Personnel” 50 24.6 52.4 9.7 12.4 49.5 
AMC Employee Newspapers 50 45.2 58.8 9.9 10.8 21 
AMC Regulations 50 12.6 19.4 7.3 69 No gain 
AMC Technical Orders 50 46.3 39.2 11.2 10.3 No gain 
“Informant” (formerly 
“Post Script”) 50 49.5 70.2 10.9 11.1 55 
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ity. This publication takes approximately 10 
minutes for our employees to read. With 
some 23,000 employees, it represents 230,000 
minutes of reading time, or 3,833.33 hours. 
All reading is done during the work day so, 
taking an “average” hourly wage of $1.60, 
each monthly issue costs approximately 
$6,133.33 in employee productive time. Table 
1 revealed that the increased readability of this 
publication should make it possible for 49.5 per 
cent more employees to be able to easily read 
this publication. That means that $3,035.99 
is no longer wasted with each issue of the pub- 
lication. This illustration of productive time 
saved is limited in scope because “Strictly Per- 
sonnel” is distributed only to approximately 
23,000 employees at Wright-Patterson Air 
Force Base, Dayton, Ohio. Real savings in 
productive time can be demonstrated when at- 
tention is given to those publications distrib- 
uted throughout the entire Air Materiel Com- 
mand affecting more than 107,000 employees. 


Future Programs 
A film strip entitled, “The Key to Plain 
Talk,” has been prepared for distribution 
throughout the Air Materiel Command. This 
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film strip discusses the need for “plain talk,” 
the Flesch formula, and examples of how to 
write clearly and simply. It has a running 
time of 15 minutes. It was released for the 
purpose of re-emphasizing the need of “plain 
talk” and for indoctrination of all new military 
officers to the Air Materiel Command. At the 
present time, a limited distribution has been 
made of a “plain talk” dictionary. This dic- 
tionary represents an attempt to give “plain 
talk” meanings to “Gobble-de-gook” words 
most frequently found in government writing. 
Also, time-worn clichés have been re-written 
in simple, straight-forward language. When 
all additions and corrections to the present dic- 
tionary have been consolidated after this pres- 
ent trial period, the final booklet will be pub- 
lished and distributed throughout the Air 
Materiel Command. 

Progress has been made to date in getting 
more key civilians and military personnel to 
write their messages in a style that is more 
easily read by our employees. With the use 
of the new film strip and subsequent publica- 
tion of the “plain talk” dictionary, it is hoped 
to achieve even greater success in simplification 
of our employee communications. 
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Since March 1944 steps have been taken to 
incorporate psychological tools, concepts, prin- 
ciples and attitudes into the Medical Division 
of the Caterpillar Tractor Co. as a fulcrum to 
extend throughout the plant embracing all 
aspects of inter-personal relationships. Not 
until January 1946 was a program put into 
active operation. This comprehensive mental 
hygiene program has been described in previ- 
ous publications (2,3,5,7). This. report will 
attempt to evaluate several of the facets such 
as selection, placement and counseling. 


Method 


Tke pre-employment psychological test bat- 
tery given to production workers consisted of 
four tests: the Wonderlic Personnel Test, the 
Bennett Test of Mechanical Comprehension, 
the Cornell Word Form and the Cornell Index 
(1, 4, 6,8). The first two concerned ability 
and aptitude and the latter two, emotional 
stability or personality. Foremen were asked 
to give the names of employees hired during 
1946 who were doing excellent work and the 
names of those who were doing extremely poor 
work, A series of 250 cases were selected since 
they fulfilled the following criteria: (1) tested 
in 1946; (2) assigned to a starting job; (3) 
could qualify as an “extreme” man—either 
very good or very poor. Ox these, 150 em- 
ployees were earmarked as “good,” and 100 as 
“poor.” Reference to the employees’ test 
results determined some months previously 
was made to see whether these results could 


* Formerly Personnel Consultant, Medical Division, 
Caterpillar Tractor Co., Peoria, Illinois. Now, Associ- 
ate Professor of Medical Psychology, Dept. of Psychi- 
atry and Mental Hygiene, University of Louisville 
School of Medicine, Louisville, Ky. This paper con- 
cerns a project devoted to the development of a mental 
hygiene program for industrial organizations. Other 
participants in this project were: Drs. Vonachen and 
Kronenberg of Caterpillar Medical Division and Drs. 
Mittelmann, Brodman, and Wolff of the Cornell 
University Medical College, New York City, N. Y. 
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have predicted the employees’ subsequent job 
performance as rated by the foremen. The in- 
dependent variable was determined from a 
psychograph which showed the relationship of 
the four tests of the battery to one another. 
Table 1 illustrates the psychograph that was 
developed and indicates how the “profile” 
could be determined which shows the relation- 
ship of each of the tests in a pattern. The 
first letter refers to the qualitative grade which 
reflects the centile ascribed for the intelligence 
test, the second for the mechanical aptitude 
test, the next two for the personality measures 
respectively. The “profile” values were 
grouped according to significance such as 1,2, 
3,4, in one group against 5,6,7 values.! 
Results 


In 72 per cent of the cases, the tests agreed 
with the foremen’s subsequent appraisals; how- 
ever, there was 65 per cent agreement between 
tests “profiles” and subsequent poor job per- 
formance, while there was 77 per cent agree- 
ment between tests “profiles” and subsequent 
good performance. The tetrachoric correlation 
was +.63. Hence, there is a good relationship 
between on the job performance and the “pro- 
file’ value which indicates scores on the test 
battery. Another method of analysis was to 
corabine the Cornell Index and.the Cornell 
Word Form scores and to compare this new 
pattern against the foremen’s ratings to see 
what relationship existed between personality 
measures and subsequent job performance. It 
will be seen from Table 2 that a critical ratio 
of 3.46 was found indicating a significantly 
reliable difference between employees rated 
“good” and employees rated “poor” in their 
subsequent job performance. A similar analy- 


1 After the “profile” had been obtained for each sub- 
ject, values of 1, 2, 3, 4 or 5, 6,7 were assigned on the 
basis of pre-determined groupings, thus, for example, 
profile EEEE received a value of 1 and FFFF a value 
of 7, with various intermediary permutations. 
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Table 1 
The Psychograph of Normative Data 


Percentile Norms 


Percentile Grouping 


+ 41- 61- 
0-20 40 60 80 


81-100 


0 10 20 30 40 SO 60 70 80 90 100 Fail Poor Aver Good Excell 
Personnel (F) 05 8SDOvRKRSTABDAS HD F P A G E 
Mechanical (AA) 0 16 22 26 30 34 36 39 43 47 60 F P A G E 
C. W. F. (2) 7 P A G E 
P A G E 


6 S$ 4 


3 
3 


sis was undertaken between the combined 
scores of the Personnel Test and the Mechani- 
cal Comprehension Test for the same employ- 
ees. Tabie 3 shows a resultant critical ratio of 
6.93 indicating a significantly reliable differ- 
ence between employees rated “good” and em- 
ployees rated “poor” on job performance in 
comparison with their intelligence and their 
mechanical aptitude scores. 


Evaluation of Counseling . 


Counseling the maladjusted employee was a 
major aspect of this mental hygiene program. 
Almost all newly hired employees were inter- 
viewed briefly by the writer who had before 
him the subject’s tests scores in the form of the 
“profile.” This “‘profile’’ helped to shed addi- 
tional light upon the psychological status of 
the employee so that one of several dispositions 
could be affected. In the first place, the test 
results were considered in relation to the job 
placement to see if this choice was suitable 
(since the employment supervisor had a copy 
of the “profile” and since he was trained in the 


Table 2 


Data for the Critical Ratio Applied to the Combined 
Scores of the Cornell Index and the Cornell Word 
Form for Employees Rated “Good” and for 
Employees Rated “Poor” by Their Foremen 


interpretation of this tool, he made very few 
misplacements). Secondly, the test results 
enabled the writer to suggest to the employee 
that he return for a brief counseling session if 
necessary, since his emotional stability scores 
may not have been adequate. In many situ- 
ations, job re-classifications occurred even be- 
fore the employee was finally placed since it 
was felt that he lacked the necessary intelli- 
gence, emotional stability or both for the job 
originally assigned. Thirdly, there were in- 
stances in which the new employee was im- 
properly placed since he had superior ability, 
and was being placed on a routine laboring job 
and it was assumed that he would be more 
efficient and would derive more satisfaction if 
placed commensurate with his abilities; thus 
several intelligent employees were introduced 
to the training department with a view toward 
apprentice training and higher rated job classi- 
fications. 

Counseling of a more specific nature, for 
*herapeutic purposes rather than for preventive 
purposes as pointed out above, was undertaken 


Table 3 


Data for the Critical Ratio Applied to the Combined 
Scores of the Personnel and the Mechanical 
Comprehension Tests for Employees Rated 
“Good” and for Employees Rated “Poor” 

by Their Foremen 


“Good” “Poor” “Good” “Poor” 
Employees Employees Employees Employees 
Number of Cases 150 100 Number.of Cases 150 100 
Mean Scores 5.4 Fe Mean Scores 58.7 46.3 
Standard Deviation 4.3 §.1 Standard Deviation 13.3 14.2 
Critical Ratio 3.46 Critical Ratio 6.93 
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with much enthusiasm as part of this program. 
In this connection, the employee referred him- 
self (herself); was referred by his foreman; was 
referred by a friend who had been counseled; 
or was referred by other professional personnel 
of the Medical Division. 

For a group of 41 employees whose counsel- 
ing was considered successful, in terms of their 
statements and those of their foremen and in 
terms of their improvement on the job, many 
were still working effectively 10.8 months or 
almost 11 months after initial steps were taken 
to counsel them. Similarly, a group of 17, 
who were unsuccessful in achieving the desired 
rehabilitation from counseling according to 
their foremen’s statements, the persistence of 
symptoms and continued inefficiency on the 
job, averaged only 3.3 months and then quit. 
For an experimental control, a randomized 
. group of 29 employees, who at the time of their 
employment were known to have emotional 
problems of seemingly the same intensity and 
severity as the 58 counseled employees, were 
nevertheless not counseled and were assigned 
to their regular jobs in the usual fashion. As 
a group, these employees averaged only 24 
months with the company and then separated 
or were discharged. In other words, where 
counseling was considered successful, the em- 
ployee remained longer with the company, thus 
reducing turnover and contributing to the 
employee’s satisfaction, efficiency, and adjust- 
ment. 


Summary 


1. This paper is concerned with a follow-up 
study of some aspects of the comprehensive 
industrial mental hygiene program at the 
Caterpillar Tractor Co. This report consists 
of the validation of the testing program and 
evaluation of employee counseling. 

2. A tetrachoric correlation of +.63 was 
found between “profile” values which reflect 
scores of a battery of four tests, namely, the 
Wonderlic Personnel Test, the Bennett Me- 
chanical Comprehension Test, the Cornell 
Index and the Cornell Word Form, against 
foremen ratings of job performance for a group 
of 150 employees rated “good” and 100 em- 
ployees rated “poor.” 

3. A critical ratio of 3.46 was determined 
between the combined scores of the Cornell 


Index and the Cornell Word Form for em- 
ployees rated “good” and employees rated 
“poor” by their foremen. For the same popu- 
lation a critical ratio of 6.93 was found for the 
combined scores of the Wonderlic Personnel 
Test and the Bennett Mechanical Compre- 
hension Test. 

4. For a group of 41 employees who were 
considered successfully counseled an average 
duration of employment of approximately 11 
months was achieved. For a group of 17 
employees considered unsuccessfully counseled, 
the average duration of employment was only. 
3.3 months. For a group of 29 controls, ran- 
domly chosen, that is, employees with emo- 
tional problems of seemingly similar intensity 
and proportions to the group who were 
counseled but who were not counseled, stayed 
with the company 2.4 months which was the 
average for so-called “normal” employees who 
did not present any emotional problems re- 
quiring help at time of employment. 
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Rating Conference Participation in a Human Relations Training Program * 


Francis J. Di Vesta, James H. L. Roach, and William Beasley 


The present study was conducted to develop 
a scale for rating conference procedure accept- 
able to a specific population. Concurrently, a 
systematic attempt was made to examine cer- 
tain procedures for minimizing those weak- 
nesses of rating scales occurring where inex- 
perienced raters are used. Although the study 
was conducted in a military program, it is ex- 
pected that the procedures and results may be 
applied or extended to other governmental or 
industrial agencies concerned with training in 
human relations by conference techniques. 
However, because the authors were dealing 
with a definitely high-level group, it is doubt- 
ful that the scales described here can be ac- 
cepted as finished instruments for general use 
with any group in which the conference method 
is used. 

This study was conducted in the Military 
Management Division of the Special Staff 
School, Air University. The purpose of this 
division is to train senior Air Force officers in 
the application of human relations and man- 
agement principles to the military situation. 

Since the course is developed around case 
problems, much of the classwork is carried out 
in conference or seminar situations. The How 
Supervise? test (5) is used as a pre- and post- 
test for determining emphases to be made in 
course content and for evaluating student 
achievement of factual material. It is in the 
seminar that the individual demonstrates his 
application of facts and management principles 
while working ina group. The performance in 
seminar situations affords the opportunity for 
observations leading to effective guidance and 
evaluation of student officers. To provide 
further training in the seminar situation the 
students rate one another. Making such rat- 
ings should help students to be more critical of 
their own personal attack on problems and to 

* Personal views or opinions expressed or implied in 
this publication are not to be construed as i 


carrying 
official sanction of the Department of the Air Force or 
the Air University. 
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gain insight into seminar technique. The 
training for the use of the rating scale is limited 
to a two-hour demonstration and practice pe- 
riod at the beginning of the course. Then dur- 
ing the course each student rates approximately 
twenty students and is rated himself approxi- 
mately twenty times. The analyses of the 
ratings are ultimately used by the instructor 
during the course for student counseling. A 
by-product of the use of the scale might be an 
estimate, in numerical terms, of the students’ 
performance. 


Assumptions 


Experience in the Military Management Di- 
vision indicated a need for an evaluation instru- 
ment which could meet six requirements in 
particular. First, the instrument should meas- 
ure the extent to which factors important to 
conference procedures in management were or 
were not demonstrated. Prior attempts at 
evaluation were based on the factors attitude, 
logical thinking, and participation. These 
factors were the basis for the selection of items 
in the present instrument. Second, the instru- 
ment should be usable with a minimum amount 
of training in its use. This requirement was 
imposed because students who were to use 
these scales were relatively untrained in the 
making of objective ratings. In addition, the 
short duration of the course (five weeks) limited 
class time for anything other than course con- 
tent. Third, the instrument should be such 
that the student doing the rating would do so 
objectively. Fourth, the instrument should be 
developed so that it could be scored objectively 
if a grade was desired. Fifth, the instrument 
should provide an analysis of performance 
which could be used by the instructor for stu- 
dent guidance and which would form a basis 
and guide for the student in improving his per- 
formance. Sixth, the instrument should re- 
flect student progress during the course. 

Rating scales previously used were not satis- 
factory because they failed to meet one or more 
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of these requirements. Common difficulties 
found were: failure of the scales to discriminate, 
difficulty in the interpretation of words or sen- 
tences, tendency to use the upper end of the 
rating scale only (7), difficulty of the rater in 
identifying traits to be measured, and low reli- 
ability. 

Certain concepts were used as a basis for the 
construction of a rating device suitable for use 
in this situation. First, items to be used were 
to represent actual, observable behavior in the 
seminar. Second, these incidents were to be 
characteristic of either good or poor perform- 
ance, but not both (5, 6, 8,10). Third, these 
items were to be placed in a check list in ran- 
dom order to lessen the possibility of the rater 
determining whether the item was descriptive 
of a relatively strong or weak characteristic. 
Students then checked only those items indica- 
tive of the behavior they actually observed 
(9). Fourth, although positive, negative, and 
neutral items were to be included in the check 
list, the positive items would be mainly used in 
the scoring. This was done on the assumption 
that the individual who demonstrated more 
positive behavior incidents described by the 
items was superior to the one who demonstrated 
fewer positive behavior incidents. This was 
especially important when students were to 
rate one another. Experience showed that 
students tended to hesitate to say anything 
negative about one another except in extreme 
cases (3). Fifth, a “buffer” would be neces- 
sary to help the individual be more objective 
than might normally be the case. 

This “buffer” was eventually accomplished 
in two ways: first, positive items which had 
nothing to do with seminar behavior were 
placed in the check list, thus allowing the rater 
to say something flattering about the student 
being rated without influencing the score; 
second, a space for written comment helped to 
serve the same purpose although these com- 
ments actually proved to be useful for guidance 
functions as well. 


Selection of Items 


The first step in the selection of items was to 
select statements from the brief word picture 
in the academic records of individuals who had 
completed the course. In addition, with the 


assistance of instructors, several] items were 
constructed which were believed to reflect the 
objectives of the course and to be easily ob- 
served in seminar performance. 

These two procedures resulted in a list of 225 
traits believed to be related to conference per- 
formance in military management. The lists 
were then submitted to each faculty member 
with instructions to comment on each item, to 
revise the item, or to submit additional items. 
On this basis a second list of 212 items was made 
and mimeographed. This list was then sub- 
mitted to the faculty to evaluate each item in 
the list on a scale of one to five in terms of how 
well it applied to a particular individual in the 
class whose performance they felt qualified to 
assess. After this was done, that individual 
was rated by the instructor on a scale running 
from one to ten, depending upon whether the 
student was considered one of the best or one 
of the poorest in his class. With another stu- 
dent in mind opposite in quality to the first one 
used, each faculty member was asked to rate 
each item in the list in terms of how well it 
applied to that individual. The faculty then 
rated each item in this tentative list a plus two 
(++) if the item indicated a very favorable 
student characteristic, plus one (+-) if somewhat 
favorable, zero (0) if neutral, minus one (—) 
if somewhat unfavorable, and minus two (— —) 
if highly unfavorable. Ratings given the items 
were weighted numerically and totaled by the 
evaluation specialists to obtain an index of the 
faculty’s reaction. Items were then selected 
on two bases: first, whether the item was used 


, more discriminately by the faculty for students 


rated as poor or as good; second, whether the 
item was positive, negative, or neutral. Fifty- 
two items were selected on this basis (10) to 
compose the first trial check list. 

The check list was then given to the students 
for use in the actual conference situation. The 
student raters were given instructions to check 
only those items which they felt characterized 
the student on the day that he was being rated. 
At the middle and at the end of the course the 
total number of checks given each item for any 
one individual were tabulated. These were 
used by the faculty to identify individual stu- 
dent weaknesses, to show progress made by the 
student, and to improve the conduct of the con- 
ference. Students were shown their relative 
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standing by comparing the total number of 
checks they were given on positive and nega- 
tive items to the average per cent of responses 
given to the entire class on these items. By 
making these tabulations the instructors had 
concrete evidence to guide them in performing 
the evaluation and counseling functions. 

Another tabulation was made of the total 
number of checks given to any one item during 
the term. On the basis of this tabulation for 
Class 50-A,! it was found that 11 items were 
used too infrequently and so were discarded. 
Seven items were selected from the original 
list of 225 items to replace, in part, those dis- 
carded. In addition, an item for rating the 
degree of participation was added at the end of 
This revised list was used in 
Classes 50-B, 50-C, and 50-D. 


Faculty Rankings as a Criterion 


At the end of each class in the course each 
instructor was asked to rank in order the top 
ten and bottom ten individuals in the class. 
The ranking was made on the basis of how well 
students performed in the military manage- 
ment seminar and was entirely independent of 
student ratings. Numerical values from plus 
ten to plus one were assigned these rankings for 
the best to poorest individuals in the top ten 
and from minus one to minus ten for the best 
to poorest individuals in the bottom ten. A 
zero score was given to those individuals not 
listed by the instructors in either of these two 
groups. The six instructor ranks given each 
individual were totaled to yield a composite 
score. On this basis, the highest possible score 
was plus sixty and the lowest possible score was 
minus: ..xty. These scores were used as the cri- 
terion for performing item analyses and valid- 
ity checks. The range of scores, means, and 
standard deviations are shown in Table 1. 
These figures provide an indication that the 
instructor rankings were sufficiently discrimin- 
ative to be useful as a criterion. 


Item Characteristics 


Table 2 shows all of the items used on the 
rating scale arranged in the order of their rela- 


1Classes are labeled by year and letter. Classes 
50-A, 50-B, and 50-C were used as a basis for this study. 
Class 50-D was used as a check on results found with 
the other classes. 


Table 1 


Means and Variability of Instructor Rankings on Stu- 
dents in Three Military Management Classes 


Range of 
Class N ra Mean S.E. 
50-A 36 +47 to —47 0 3.66 
50-B 31 +56 to —57 0 4.34 
50-C 30 +48 to —53 0 4.34 
Total 97 +56 to —57 0 2.34 


tionship to the instructor rankings of students.” 
A rating of each item according to instructor 
judgment (see page 5), tetrachoric correlations 
of each item with instructor ranking of students, 
and the frequency with which each item was 
used are also shown in the table. 

The tetrachoric correlation was used as a 
basis for determining the discriminatory value 
of the item. The first dichotomous variable 
was whether the item was or was not checked. 
Individuals with instructor rank scores above 
and below the average were used for the second 
dichotomous variable. Each of the tetrachoric 
correlations has been calculated from approxi- 
mately twenty ratings on each individual in the 
class, or a total of approximately 1970 ratings. 

Tetrachoric correlations serve the following 
purposes for analysis: As a discrimination index 
they point to items which, when checked, dif- 
ferentiate between good and poor individuals. 
The sign of the correlation indicates whether 
the item is associated with good or poor per- 
formance in the seminar situation. When the 
item is used frequently enough to avoid unreli- 
able correlations, it indicates which factors con- 
tribute to and are considered important in good 
seminar performance, which are neutral, and 
which factors detract from good performance. 
Thus, offering suggestions, inspiring confidence, 
setting a good example (in conference), stimu- 
lating thinking of participants, original think- 
ing, clarity of expression, and organizational 
ability are factors related to good performance 
as judged against instructors’ ranking of stu- 


2 To reduce printing costs the complete Table 2 has 
been deposited with the American Documentation 
Institute. Order Document 3254 from American Doc- 
umentation Institute, 1719 N Street, N.W., Washing- 
ton 6, D.C., remitting $1.00 for microfilm (images 
1 inch high on standard 35-mm. motion-picture film) 
or $1.00 for photocopies (6 X 8 inches) readable with- 
out optical aid 
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dents. On the other hand, inability to “sell 
point of view,” inability to work as a member 
of the team, slowness in accomplishing the goal, 
and being a follower rather than a leader are 
highly related to poor performance. These 
correlations also tend to give face validity to 
the techniques used here. 

A comparison of correlations from class to 
class on any single item will indicate whether 
the discriminative value is relatively consistent. 
For example, the correlations between re- 
sponses on the item, “needs to develop more 
ability as a member of a team,” and instructors’ 
ranks for each of the three classes were found 
to be consistently negative. On the other 
hand, correlations between responses to the 
item, “argumentative when his point is ques- 
tioned,” and instructors’ ranks for each of three 
classes were found to vary between low nega- 
tive to high positive in value. Variations from 
class to class may be interpreted to mean that 
the statement was either ambiguous, permitting 
different interpretations, or that different 
groups varied in their willingness to use the 
item. The latter factor influences the number 
of times the item was used, which in turn in- 
fluences the reliability of the results. 

The percentage of checks for each item shows 
the frequency with which the item was used. 
Positive items were used more frequently than 
negative items. That is, students checked 
more items associated with good performance 
than with poor performance. Since students 
show a reluctance to check negative items, it 
may follow that the use of only positive items as 
a basis ‘or a score is a sound practice when cb- 
jective scores are desired in this situation. 
Thus, an individual who is given more checks 
on more positive characteristics may be said 
to perform more favorably than individuals 
with fewer ratings on positive items. Nega- 
tive and neutral items may still be included in 
the check list. When negative items are 
checked, they provide a good basis for guid- 
ance of individual students and point out con- 
ference procedure requiring attention from the 
instructor. 


Characteristics of the Scale 


For purposes of this study, those items with 
consistent positive or negative correlations 
from class to class were used as a basis for 
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deriving a score from the scales. Twenty-one 
positive items*® and five negative items‘ were 
selected. Items with positive correlation co- 
efficients were weighted a plus one and those 
with negative coefficients, a minus one. This 
was the only basis for weighting the items. 
Only checked items were used in computing the 
grade. 

The means and standard error of the means 
for the first and second half (in terms of time) 
of each of the three classes are shown in Table 
3. Critical ratios between the scores for each 
half are also shown. For two of the classes 
(50-A and 50-C) the scale reflected an improve- 
ment in students’ performance. For Class 
50-B the scale scores were lower for the second 
half than for the first half of the course. The 
Pearson correlation between paired scores (N 
= 95) from the first half and second half was 
.62. This correlation, when compared to reli- 
ability standards for tests, is relatively low but 
considerably higher than correlations found 
with other rating devices (.15 to .37). 

The Pearson correlation between the total 
average scale score and instructor rank scores 
was .55. Correlations between the total aver- 
age scores derived by weighting all items ac- 
cording to instructor preference indexes and the 
instructor rank scores were .77 for Class 50-A, 
.44 for Class 50-B, .35 for Class 50-C, and .37 
for all classes combined. These relationships 
seem to indicate that, if the instructor rankings 
are valid criteria, the selection of items for this 
situation on a statistical basis using class data 
provides a more consistent measure than items 
selected on the basis of judgments alone. 

Selected items were next weighted plus two 
to minus two on the basis of the magnitude, as 
well as the sign, of the tetrachoric correlation. 
The correlation between this score and the in- 
structor ranking was .53. This is not signifi- 
cantly different than when items were weighted 
plus one to minus one. 

The degree of participation of each partici- 
pant in Class 50-B and Class 50-C was deter- 
mined by requiring the rater to check the one 
of five categories best describing the partici- 


* From Table 2 the positive items were Numbers 1, 
2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18,19, 21, 23, 
24, and 26. 

‘From Table 2 the negative items were Numbers 
36, 38, 39, 40, and 41. 


| 
4 
J 
t 
/ a 


Francis J. Di Vesta, James H. L. Roach, and William Beasley 


Table 3 


A Comparison of Composite Rating Scale Scores for 
Each Half of the Course 


Second Half Critical 


Mean S.E. 
5.99 
6.82 
6.81 
6.50 


9.35 
5.59 


All Classes 95 6.48 


pant being rated. These categories were (1) 
seldom or never participates, (2) occasionally 
participates, (3) regularly participates, (4) 
participates quite often, and (5) participates 
at every opportunity. Pearson correlations of 
varying magnitude were obtained between the 
degree of participation and the following: scale 
score based on all items weighted +2, +1, —1, 
—2 according to instructors’ judgments, 0.80 
(N = 59); scale score based on twenty-six items 
selected and weighted +1 to —1 according to 
tetrachoric corrélations, 0.74 (N = 59); in- 
structor rankings of students; 0.42 (N = 59). 

It is reasonable to expect a high relationship 
between participation and the scores from the 
scale since the more the individual participates, 
the more opportunity will he have to demon- 
strate characterisfics on the check list. On the 
other hand, the correlations are sufficiently low 
to enable one to make the assumption that 
something more than quantity, probably 
quality, of performance enters into the ratings. 

The scale score, based on twenty-six items, 
was compared with scores on the How Super- 
vise? test given to fourteen volunteer students 
from Class 50-C at the end of the course. In 
this comparison the rank-difference correlation 
was —.08. Because of the small number of 
students and because the students were volun- 
teer subjects, the validity of the results is open 
to question. However, the results suggest the 
possibility that the test and the rating scale, 
because of the apparent independence of one 
another, can be used to supplement one another 
for evaluation purposes. 


Follow-up 


Following the study reported thus far, data 
became available on a new class (Class 50-D). 


This presented an opportunity for checking the 
results on the previous classes. The number 
attending this class was considerably smaller 
(N = 18) than usual. 

With the data from this class Pearson corre- 
lations, similar to those from previous classes, 
were found between the check-list scores based 
on previously selected twenty-six items and 
the following: final How Supervise? test scores, 
0.01 + .15; instructors’ rankings of students, 
0.59 + .10. The correlation between check- 
list scores from the first half of the course and 
those from the second half of the course was 
0.48 + .13. The correlation between the in- 
structors’ rankings of students and the final 
How Supervise? test was 0.21 + .14. 

When these correlations are viewed in terms 
of the correlations found in the previous classes, 
it will be noted that the results are similar. 
The relationship between scores from both 
halves of the course has dropped, while the cor- 
relation between the check-list score and in- 
structor ranking has ‘increased. However, 
because of the small number of cases involved, 
the probable errors are large and the figures 
should be considered only as estimates. The 
only correlation significant at the .01 level is 
the correlation between check-list scores and 
the instructor ranking of students. 


Summary 


The check list was designed for student 
ratings of one another in seminar situations in 
an effort to minimize common weaknesses of 
rating scales. Techniques employed were: 
using items describing specific behavioral in- 
cidents, using more positive items than nega- 
tive for grading purposes, using discriminating 
items for guidance purposes, requiring the 
student to check only behavior observed rather 
than to rate characteristics, and using buffers to 
reduce subjectivity of responses. 

In the practical situation the check list pro- 
vided data which were considered valuable by 
both students and faculty. The list had face 
validity as well as validity with instructor 
observations. 


Received December 18, 1950. 
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Consistency of Interview Methods in Appraisal of Attitudes * 
Carl Wedell and Karl U. Smith 


University of Wisconsin 


A main problem of current clinical and in- 
dustrial psychology is one of method relating 
to making attitudinal measurements. Ques- 
tionnaire methods and interview methods are 
recognized procedures of eliciting responses in 
terms of which such measures may be derived. 
In recent years, emphasis has been laid on the 
use of interview-rating techniques as a more 
acceptable method of eliciting responses in 
attitudinal work both in applied and academic 
research. This emphasis seems to have de- 
veloped from notions that the interview 
achieves greater depth of penetration into basic 
attitude patterns and provides a better con- 
trolled method of making judgments about re- 
sponses elicited. It is widely recognized, how- 
ever, that claims concerning general validity 
and reliability of different methods of attitude 
study in various areas of theoretical and ap- 
plied psychology have not been based upon 
concrete investigation. 

In the different fields of attitude measure- 
ment, it is often difficult, if not impossible, to 
obtain comprehensive and relevant estimates 
of the reliability and validity of the methods 
used. Some indirect channels of research, 
however, provide a means of securing informa- 
tion about techniques which is suggestive in ap- 
praising procedures in regard to their relative 
acceptability as instruments for research and 
application. In the present study, an indirect 
approach of this sort has been made in order 


* The data of this experiment were obtained by Dr. 
Wedell before his death in an automobile accident in 
February, 1950, in line of duty in completing this 
investigation. The report has been prepared by Karl 
U. Smith of the Department of Psychology in the form 
that it is believed Dr. Wedell would have presented the 
material. This study was conducted as a part of the 
research program of the Bureau of Industrial Psychol- 
ogy, the University Extension Division, Madison, 
Wisconsin. The study was carried out under a co- 
operative project with the Ansul Chemical Company, 
Marinette, Wisconsin. The Ansul Chemical Company 
supported the project as a part of their systematic 
management and ‘investigation within the company. 
A report of this research was presented at the meetings 
of the Midwestern Psychological Association, May 5 
and 6, 1950. 


to compare questionnaire methods and inter- 
view techniques in the appraisal of attitudes. 

This study was set up as a part of systematic 
attitude survey among the employees of a 
small chemical company of about 250 men and 
women. The object of the study was to deter- 
mine the relative consistency of interview rat- 
ings of attitude on certain critical situations 
relative to self-judged attitude of the subjects 
about these same situations. From this com- 
parison, it is possible to gain certain facts about 
the nature of both self-judged attitude and 
interview-judged attitude. 


Methods 


Self-judged attitude level was measured by 
use of a questionnaire booklet containing 55 
five-choice scaled items which covered especi- 
ally company policy, management, supervision, 
associates, job satisfaction, and work condi- 
tions. The booklet was designed to be under- 
stood by a person of average high-school educa- 
tion, to avoid halo effects, and to minimize 
controversial factors outside of specific plant 
relations. Approximately 200 employees filled 
out the questionnaire booklets, in which fur- 
ther comments and qualifying statements were 
encouraged at the bottom of each question. 

Six trained and experienced interviewers 
used patterned and open-ended techniques in 
interviewing those employees who had taken 
the questionnaire. Two of the interviewers 
were trained at the doctorate level, one at the 
level of the Master’s Degree, and three at the 
baccalaureate level. All had had some train- 
ing in counseling bureaus, industry, and guid- 
ance work. Those with advanced training had 
extensive experience in this type of work. 

A general pattern of interview questions was 
followed for each employee, and the interviewer 
wrote the context of each interview either dur- 
ing or after the interview, depending on the 
rapport established with the employee. Inter- 
views were conducted within two weeks from 
the time of filling out the questionnaire. 
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At the end of each interview the employee 
was asked to give the number of the question- 
naire booklet he had filled out. Only thirteen 
men refused to identify their booklets. After 
the employee had left, the interviewer marked 
six core questions in the booklet in the way he 
thought the employee had answered them. 
The interviewer did not ask for specific self- 
ratings of attitudes on specific questions during 
the interview. All interview ratings were 
made on the same scale that a given item was 
assigned in the questionnaire. 

Rating of the original interview record sheets 
was then carried out by two individuals who 
had not. done the interviewing. They rated 
three of the six core questions on the basis of 
the information given in the interview record 
sheets for each employee. One of these per- 
sons was trained at the doctorate level and the 
other beyond the baccalaureate level. The 
rater with the lower level of training was ac- 
quainted with the detailed experimental set-up 
and development of this project, the other rater 
only knew the general outline. There were 
no significant differences between the ratings 
of the two raters, but to rule out any question 
of validity of the ratings, data from the person 
well acquainted with the project will not be 
included in the results. 

Three core questions were thus self-judged 
by each employee, rated by the interviewers, 
and objectively rated from observation of the 
original interview record sheets by the trained 
raters. These three core questions were: 


1. How do you feel this company is run? 

2. My usual job is (statement of level 
of job satisfaction). 

3. My foreman is 
supervision). 


— (attitude toward 


For purposes of analysis the company was 
divided into four groups on the basis of divi- 
sions in the company. Attitude level differed 
in these four divisions. In one group, attitude 
was very high, in another it was low, and the 
other two had intermediate attitude levels. 


Results 


Three main results were obtained in the 
experiment, 
Interviewers overestimate attitude in com- 


* Significant difference 
Questionnaire 
90 interview 
7 | 
2 
70 
3 
§ 60 
50 
Core Questions 
Fic. 1. Comparison of the average level of employee 


attitude toward three core questions as measured by 
the questionnaire and the interviewers. 


parison with self-judged attitude.' Significant 
overall differences were found for two core 
questions. These results are illustrated in 
Figure 1, which summarizes the differences be- 
tween the average employee self-rated level of 
attitude toward three of the six core questions 
as taken from the questionnaire, and the aver- 
age of the six interviewers’ estimates of indi- 
vidual employee attitude toward the same 
three core questions. The scale has been 
changed from the original five-choice scaled 
answer to a corresponding 100-point scale for 
easier reading. The figure shows the upper 
half of the 100-point scale along the ordinate. 
As can be seen by observation of Figure 1, the 
greatest differences between the two measures 
are found in the two questions dealing with the 
company and with job satisfaction. These 
differences are statistically significant at the 5 
per cent level or less. The third question 
about attitude toward foremen does not show 
a significant difference between the two meas- 
ures. On the three additional core questions, 
in terms of which interview ratings and self- 
judged ratings could be obtained, significant 
differences were found for one of the three 
questions. 

Objective rating of interview record sheets 
is closer to self-rating (questionnaire) than the 
interviewers’ rating. Significant differences 
between questionnaire and record-ratings were 
found for only one of the three core questions. 


1 Application of x? to test the significance of a net 
change or difference (2). 4 7 
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These results are presented in Figure 2 which 
shows the differences between the average em- 
ployee self-rated level of attitude toward the 
core questions as taken from the questionnaire, 
and the average of the objective ratings based 
on the original interview-record sheets. The 
scale again has been changed from a five-point 
to a 100-point scale. Observation of Figure 2 
shows that significant differences between these 
two measures appear in question 1 dealing with 
the attitude toward the company. Questions 
2 and 3 do not show significantly different levels 
of attitude estimation. 

A third major result found is that interview- 
ers vary significantly among themselves in the 
degree of correspondence with the self-ratings 
of the questionnaire. Data concerning this 
point are summarized in Table 1. In this 
table the number of the interviewer is indicated 
to the left. The nature of the different core 
questions is described at the top of the table. 
The notations in the body of the table represent 
the instances in which interview-judged atti- 
tude level showed no significant difference from 
self-judged attitude level (NS) and instances 
in which significant and very significant differ- 
ences were found between these two judgments. 

Interviewers 5 and 6 were the most exten- 
sively trained of all the interviewers. Both 
also had had extensive practical and teaching 
experience in interviewing in clinical and voca- 
tional types of work. Interviewer 1 was a 
student of clinical psychology with the Master’s 


Significant difference 
questionnaire 
> 
© 
~ 
3 
a 
50 
Company Job Foresan 
Core Questions 


Fic. 2, Comparison of the average level of employee 
attitude as measured by the questionnaire and the 
interview-record ratings. 


Carl Wedell and Karl U. Smith 


Table 1 


Significance of the Differences Between the Ratings of 
Interviewer and the Self-ratings of the 
Questionnaire Dealing with the 

Three Core Questions 


Questions 
Inter- 
viewer 1. Company 2. Job 3. Foreman 
1. 
2. ‘9 NS NS 
3. we NS NS 
4. NS NS NS 
me 
6. * NS 


** Significant at the 1% level. 
* Significant at the 5% level. 
NS No significant difference. 


degree and had had two years of practical and 
research work in prison and vocational work. 
The ratings of these three men varied most 
from the self-ratings. Interviewers 2, 3, and 
4 were relatively untrained in any advanced 
aspect of vocational, industrial, or clinical 
psychology. All had only the baccalaureate 
degree. Interviewer 3 was relatively inex- 
perienced in work of these various natures. 
Interviewer 2 was primarily trained as a soci- 
ologist and had had over one year of vocational 
advisement experience. Interviewer 4 pos- 
sessed the baccalaureate degree, lacked formal 
qualifications for advanced study, and had had 
about one year’s experience as an aide in a vet- 
eran’s counseling agency. These three men 
were far more consistent among themselves 
than the other three interviewers, and, in addi- 
tion, showed only two instances of discrepancy 
with the self-ratings of the employers. 

When the data were first examined, it was 
thought that the more experienced interviewers 
might be getting a more insightful appraisal of 
attitude in the employees than the subjects 
themselves could give on the questionnaire. 
Accordingly, the experienced interviewers 
would show more frequent discrepancies from 
the self-rating. Breakdown of the data by 
interviewers and by departments shows that 
this is not the case. The nature of interviewer 
discrepancy is always toward the mean of the 
self-ratings. When the mean self-ratings for 
departments is relatively low in comparison 
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for the mean of all ratings, the interview ratings 
overestimate the self-judged attitude level. 
In one department, in which attitude-level 
was very high, interview-ratings underestimate 
the level of the self-ratings on two of the three 
questions. Individual interview-ratings ex- 
aggerate these trends in the cases of different 
interviewers showing significant discrepancies 
from the self-ratings. 

Another point concerning the nature of dis- 
crepancies in the interview-ratings may be sug- 
gested. The most frequent discrepancies be- 
tween interview-rating and self-rating are 
found on the question about the company. 
Less frequent bias in interview-rating is found 
for the questions about “job” and “foreman.” 

Specific research was conducted to determine 
the effect of the nature of the question or issue 
upon the interview ratings. The issues of 
three additional core questions in the booklet 
were also rated by the interviewers. These 
questions were: 


4. Does your job give you a feeling that you 

are doing something really worthwhile? 

5. Do you think that the rate your job pays 
is fair compared with other jobs? 

6. Considering my skill, ‘my experience, and 
my training, the job I now have :. . (is 
satisfactory, does not make use of my 
ability, calls for training and knowledge 
that I don’t have). 


These three questions required a rating of 


only three intensity-levels, instead of five, asin ~ 


the first three questions. In addition, these 
three questions were judged to be personally 
oriented, rather than company or supervisor 
oriented and based on concrete experience and 
satisfaction rather than zeneral attitudes. 
The frequency of interviewer discrepancies 


. with respect to the self-ratings on these three 


questions is given in Table 2. There are four 
very significant interviewer discrepancies on 
these three questions, and a total of six sig- 
nificant discrepancies, as compared to seven 
very significant discrepancies and a total of ten 
significant discrepancies on the first three ques- 
tions. On these issues, as in the first case, the 
bias of the interviewer is generally such that an 
overestimation of the level of self-judged atti- 
tude occurs. 


Table 2 


Interviewer Discrepancies from Self-ratings on 
Concrete Issues 


Questions 

Inter- 4. Job is Worth- 5. Fair 6. Job and 
viewer while Activity Pay Training 

1 NS . NS 

2 ae +e NS 

3 NS NS NS 

4 NS NS NS 

5 NS 

6 NS NS 


** Significant at the 1% level. 
* Significant at the 5% level. 
NS No significant difference. 


It is interesting to note that two of the three 
interviewers showing the greatest consistency 
with respect to the self-ratings on issues 1, 2, 
and 3 also show the least bias on questions 
4, 5, and 6. These two men were the least 
trained and experienced of the group of inter- 
viewers. One of the most inconsistent raters 
on the last three questions, interviewer 5, was 
one of the most highly trained and experienced 
among the group of interviewers. He was 
also one of the two interviewers showing the 
greatest discrepancy from the self-ratings on 
issues 1, 2, and 3. 


Discussion and Conclusions 


This research was conducted to investigate 
the question of consistency of interview-ratings 
among trained and partly-trained interviewers. 
The set-up of the experiment permitted in- 
direct :ppraisal of the consistency of interview 
ratings of attitude with respect to self-ratings 
of the attitude on the same issues and questions. 
The self-ratings were obtained by means of 
arbitrarily scaled questionnaire items. 

Results showed that six major features were 
typical of the interview-ratings when they were 
compared to the self-ratings of attitude. 


1. Interview-ratings show frequent and 
marked discrepancies from self-ratings, the di- 
rection of which is generally toward the mean 
of the self-ratings. The most typical discrep- 
ancy is for interviewers to overestimate the 
level of relatively low self-ratings. When 
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self-ratings are high, however, interviewers 
are most likely to underestimate the self-rating 
level. 

2. Interviewers are inconsistent among 
themselves in estimating attitude-level as indi- 
cated in self-ratings. That is to say, different 
interviewers not only vary with respect to 
the self-ratings but differ markedly from each 
other. 

3. Interview-ratings among highly-trained 
and experienced interviewers show more fre- 
quent and more serious discrepancies from self- 
ratings than do the ratings of relatively un- 
trained and inexperienced college graduates. 
The discrepancies noted cannot be attributed 
to greater “depth” of psychological analysis 
on the part of the experienced interviewers 
inasmuch as their ratings exaggerate the 
tendency of making ‘“‘mean-level” ratings. 

4. Interview-ratings vary from self-ratings 
as a function of the issue or question rated. 
Interview ratings on concrete personal issues 
gave somewhat fewer discrepancies than rat- 
ings obtained with general issues of satisfaction, 
company attitude, and supervision. Of the 
issues studied, the question concerning attitude 
toward company gave the greatest amount of 
interviewer discrepancy with self-ratings. 

5. When interviewer discrepancy from self- 
rating is studied with concrete questions de- 
manding only three intensity-levels of rating, 
almost as many discrepancies between the two 
types of rating occur as are found with general 
questions demanding rating at five intensity- 
levels. 

6. The greater accuracy of inexperienced 
and untrained interviewers in estimating the 
level of self-ratings and the relative inaccuracy 
of highly trained and experienced psychologists 
to make such estimates occur more or less con- 
sistently both with general questions requiring 
ratings at five-intensity levels and with con- 
crete questions requiring ratings at three in- 
tensity-levels. 


Carl Wedell and Karl U. Smith 


The estimation of attitudes by means of ob- 
jective devices and interview observations is 
crucial for almost all fields of general and ap- 
plied psychology. ‘The expanding use of inter- 
view-ratings of attitude in the study of -per- 
sonality, marketing .problems, job analysis, 
public opinion, industrial placement, and in 
clinical problems seems to have developed as a 
part of the notion that greater analytic per- 
spective and depth can be achieved by face-to- 
face techniques guided by the trained and ex- 
perienced psychologist. By and large, this 
study seems to show that methods of individual 
interview-rating are highly inconsistent and 
tend toward superficiality rather than analytic 
“depth,” and possess certain general features 
of variation which are uncontrollable. A par- 
ticularly discouraging aspect of such variation 
is the fact that professional training seems to 
enhance its degree. 

Since the results of this research were as- 
sembled and presented at the Midwestern 
Psychological Association in May, 1950, a 
study on the validity of interviewers’ ratings 
in prediction of individual success in clinical 
psychological training has been reported (1). 
The results of this work, as reported by Kelly 
and Fiske, seem to agree in detail with the ob- 
servations made here on the nature of ratings 
made by experienced and inexperienced inter- 
viewers. 

Relative to the conduct of investigations of 
attitudes, this research points toward increased 
emphasis upon the development of objective 
scaling devices. 


Received December 20, 1950. 
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Group Size and Leaderless Discussions * 


Bernard M. Bass and Fay-Tyler M. Norton 
Louisiana State University 


The leaderless group discussion is used to 
assess candidates for leadership positions. As- 
sessors observe but do not participate in a dis- 
cussion carried on by a group of candidates. 

The number of candidates assessed as a 
group has varied considerably. Among 44 
civil service agencies renorting use of the 
leaderless discussion technique, the number of 
applicants examined at one time varied from 3 
to 10. Most frequently reported in use by 
these agencies were groups of 8 (5). Eight 
examinees per group was also favored by the 
British officer selection board (6). Carter (4) 
has studied groups of 2 and 4. Experimental 
studies have been made by the senior author 
and others (1, 2, 3,7) using 4 to 10 candidates 
per group. 

In these last cited studies, it has been found 
impractical, expensive, and sometimes im- 
possible to organize a large number of discus- 
sion groups and keep constant the size of the 
groups. If the leaderless group technique is to 
be used in industry and government, it will 
need to be capable of operation with groups 
varying in size. 

Purpose 

The purpose of this study was to investigate 
the effects of variations in group size of initi- 
ally leaderless discussions on: (1) the mean 
leadership rating attained by participants on 
supposedly “absolute” rating scales; (2) the 
extent of stratification which developed, as 
measured by the variance in leadership ratings 
attained by participants; (3) the extent of 
agreement among raters; and (4) the consist- 
ency of participant behavior. 


Method 


One hundred and twenty college students 
drawn from undergraduate psychology courses 


* This study was aided by a grant from the Graduate 
Council on Research of Louisiana State University. 
A paper concerning this study was read at the Forty- 
third Annual Meeting of the Southern Society for 
Philosophy and Psychology, Roanoke, Va., March 23, 
1951. 
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were divided into 5 aggregates of 24each. One 
aggregate was split into 12 groups of 2 ex- 
aminees. A second aggregate of 24 was split 
into 6 groups of 4; a third aggregate was split 
into 4 groups of 6; a fourth aggregate was split 
into 3 groups of 8, and a fifth aggregate was 
split into 2 groups of 12. 

The 27 groups varying in size from 2 to 12, 
each participated in a 30 minute discussion 
group. The discussion and procedure have 
been outlined elsewhere (1). Topics for dis- 
cussion were drawn from course work. A week 
later, the same groups were retested in repeat 
discussions using new but similar problems for 
discussion concerning psychology course ma- 
terials. 

Two trained observers made absolute judg- 
ments of each discussion participant and as- 
signed points by indicating whether the par: 
ticipant showed a designated item of behavior 
a great deal, « points; fairly much, 3 points; to 
some degree, 2 points; comparatively little, 1 
point; and mot at all, 0 points. 

The 9 items of behavior® upon which such 
judgments were made were those found most 
valid for identifying potential college leaders 
in a previous study (3). These items had an 
average intercorrelation before refinement of 
.85. A total rating based on these assignments 
of points could vary from 0 to 36. This total 
rating was used as an absolute index of leader- 
ship behavior in discussion. 


''The aggregates and groups were organized as ran- 
domly as possible except for one restriction. All the 
examinees of any particular group had to come from 
one particular psychology course because of practical 
considerations. This probably served to reduce slightly 
the variance of ratings within groups, the standard 
errors of group means and the reliabilities of ratings. 

? The nine items were: 1. He was effective in saying 
what he wanted to say; 2. He offered good solutions 
to the problem discussed; 3. He showed initiative; 4. 
He clearly defined the problems . .-.; 5. He motivated 
others to participate in the discussion; 6. He led the 
discussion; 7. He influenced the participants . . .; 8. 
He seemed interested in the discussion; 9. He knew 
about the topic discussed. 
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Results 


Effect of Group Size on the Mean Rating As- 
signed. Section A of Table 1 shows the mean 
total points received by each of the 5 aggre- 
gates of participants tested in discussion groups 
of varying size as assigned by each observer on 
the original discussion and on the retest. 
Analysis of variance of the composite means for 
each aggregate, aggregate by aggregate, for 
the test and retest, yielded an F ratio® which 
when converted to epsilon, resulted in a co- 
efficient of .91. Thus, when considering can- 
didates examined in discussion groups varying 
in size from 2 to 12, 83 per cent of the variance 
among candidates in assigned ratings could be 
accounted for by the size of the group in which 
they were examined. A man examined in a 
2-man group was likely to earn a leadership 
assessment, supposedly absolute, twice that 
earned by a man examined in a 12-man group, 
other things being equal. It seems that op- 
portunity to adopt leadership functions in a 
group decreases directly with the number of 
members in that group. It is inferred that 
proper correction must be made in any leader- 
less discussion studies where the different per- 
sons have been examined in groups varying in 
size. 


* As shown in Section B of Table 1, the variances of 
ratings were somewhat heterogeneous. They tended 
to increase as the means decreased. The resulting F 
ratio was probably a conservative estimate. 


Bernard M. Bass and Fay-Tyler M. Norton 


Table 1 


Means and Standard Deviations of Leadership Ratings Attained by Participants Assessed in 
Initially Leaderless Discussion Groups Varying in Size 


Effect of Group Size on Extent of Stratifica- 
tion. Table 1, Section B, shows the changes of 
standard deviation of total ratings assigned 
that occurred as a function of change in dis- 
cussion group size. Bartlett’s test for homo- 
geneity of variance yielded chi square values 
which were not significant at the 5 per cent 
level for test or retest standard deviations. 
There was a tendency for absolute variation 
to be at a maximum with discussion groups of 
6. If the inverse relationship between means 
and standard deviations is considered, it ap- 
pears that relative variance of leadership rat- 
ings tended to increase with increase in dis- 
cussion group size. 

Effect of Group Size on Observer Agreement. 
Each correlation of the top half of Section A of 
Table 2 indicates the extent of agreement be- 
tween two observers, each rating a total of 24 
subjects who participated in discussion groups 
of a designated size. Similarly derived cor- 
relations are shown for the retest discussions 
held a week later. 

Maximum agreement was reached when 6 
participants per group were assessed. 

The 10 correlations reported in Table 2, 
Section A, representing the agreement between 
observers A and B rating the same participants, 
for each group size, in test and retest, were sub- 
jected to a three component analysis of vari- 


Group Size 


A. Mean Rating Assigned 
Test—Observer A 


* Test—Observer B 21.3 
Retest—Observer A 23.2 
Retest—Observer B 23.3 

Average 23.2 
B. Standard Deviation of 
Assigned Ratings 
Test—Observer A 7.8 
Test—Observer B 78 
Retest—Observer A 6.7 
Retest—Observer B 6.7 


Average 


23.7 19.3 20.6 14.4 
16.7 16.7 12.7 98 
25.1 20.4 17.1 12.0 
22.5 18.1 14.5 10.4 
22.0 18.6 16.2 11.6 
9.8 12.4 12.8 11.0 
10.3 11.8 11.7 10.4 
94 10.8 10.2 11.1 
9.3 11.8 9.2 10.4 
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Group Size and Leaderless Discussions 399 
Table 2 
The Effects of Group Size Upon Various Estimates of Reliability of Ratings and Behavior in 
Initially Leaderless Discussions (N = 120) 
Group Size 
Reliability Estimate 2 4 6 8 12 Av. 
A. Agreement between Observers A and Test 78 80 89 84 74 82 
B rating the same participants Retest 72 90. 94 #89 93 90 
Average 75 86 93 87 86 86 
B. Consistency of participant behavior— Observer A 46 70 92 57 80 74 
Each observer’s test ratings correlated Observer B 42 81 85 58 86 .74 
with his own retest ratings of same Average 44 76 89 58 83 74 
participants 
C. Consistency of participant behavior— _A test-B retest ‘oe Ge , 65 83 71 
Observer A’s test ratings correlated A retest-B test 35 86 83 .53 81 72 
with Observer B’s retest ratings and Average 34 78 85 59 82 72 
vice versa 
D. Correlations of Section C corrected for A test-B retest 45 80 95 _ .74 99 89 
lack of agreement among observers A retest-B test 47 —=—-1.00 .90 .60 97 91 
Average 46 97 .93 .68 98 90 
Number of groups per aggregate 12 6 + 3 2 
Number of cases per aggregate 24 24 24 24 24 
ance.‘ The total variance was divided into The Effect of Group Size on the Consistency 


the variance associated with group size (4 
d.f.), the variance associated with test and re- 
test (1 d.f.),5 and the remainder (4 d.f.). In- 
teraction could not be tested for significance 
since there was only 1 r per cell. 

The increase of average agreement among 
observers from .82 to .90 from the first to the 
second discussions yielded an F ratio of 2.2 
that would be accounted for by chance fluctu- 
ations. The variation of correlations among 
groups of different size yielded an F ratio of 
5.03 which also was not significant at the 5 per 
cent level of confidence. “owever, the pat- 
terns of change in observer agreement, especi- 
ally for the origina! discussion, were interesting 
enough to suggest collection of more data be- 
fore making any conclusive inferences. 


4 Before these correlations (and those rted later) 
were subjected to the analysis, they each were con- 
verted to Fisher’s z to normalize the data and make 
homogeneous their variance. 

5 For the correlations of Section B of Table 2, a 
source of variance with 1 d.f. due to observers was 
substituted for this particular source in Section A. 
For correlations of Sections C and D, a source due toa 
combination of observers and test order with 1 d.f. was 
substituted. 


4 


of Participant Behavior. Sections B, C, and 
D of Table 2 show the effect of group size on 
estimates of behavioral consistency of partici- 
pants in the two leaderless discussions in which 
each was a member. The reliability coeffi- 
cients of Section B are based on each observer’s 
ratings of participants in the test discussion 
correlated with the observer’s own ratings of 
the same participants for the retest discussion. 

To reduce any rating consistency from test 
to retest that might be attributed to rater bias 
because of memory, the ratings of participants 
in the test discussion by one observer were cor- 
related with the retest ratings of the same par- 
ticipants by the second observer. The reverse 
was also done. The resulting reliability co- 
efficients are displayed in Section C. 

The estimates of behavior consistency in 
Section C were attenuated by the less-than- 
perfect agreement among the two observers in 
their rating procedure as shown in Section A 
of Table 2. The coefficients of Section C were 
corrected for this lack of agreement by use of 
the appropriate formula for correction for at- 
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tenuation® and are shown in Section D. The 
reliability coefficients of Section D are esti- 
mates of the consistency of participant be- 
havior from the one discussion to a second dis- 
cussion among the same participants a week 
later where bias due to observer memory has 
been reduced and where the measurements of 
behavior have been made perfectly reliable by 
statistical manipulation. 

A three component analysis of variance of 
the data of Section B yielded a chance F ratio 
of .06 for the source of variance attributable 
to observers and an F ratio of 10.4 for the vari- 
ance among coefficients attributable to vari- 
ations in group size. The latter F ratio was 
significant at the 5 per cent level of confidence. 

A similar three component analysis of vari- 
ance of the data of Section D yielded an F ratio 
of .02 for observer variance and an F ratio of 
2.5 for the effects of group size both of which 
lacked significance at the 5 per cent level. 

Although the consistency of participants’ 
leadership behavior tended to vary with group 
size, the small number of coefficients available 
for analysis made rejection of the null hypothe- 
sis difficult. More data are needed, especially 
for groups of 2. It would not be unreasonable 
to suggest that participant leadership behavior 
in 2-man group discussion situations may be 
less consistent than for groups of larger size. 


If this inference is borne out, it may have | 


ramifications for those studying initially leader- 
less groups of this size and for those working 
with the traditional interview. 


® See Peters, C. C., and Van Voorhis, W. R. Statis- 
tical procedures and their mathematical bases. New 
York: McGraw-Hill, 1940, pp. 203. 


Summary 


When leaderless discussion participants were 
studied in groups of 2, 4, 6, 8, and 12, there was 
a significant decline in the mean leadership 
assessment earned by participants as the 
groups studied became larger in size. 

Maximum stratification in the absolute sense 
occurred in discussion groups of 6. Relative 
stratification tended to increase directly with 
increases in discussion group size. 

Observer agreement also reached a maximum 
with discussion groups of 6 and tended to de- 
cline as group size was altered in either di- 
rection. 

Consistency of leadership behavior was at a 
minimum in discussion groups of 2. Beyond 
this point, no systematic trends were clearly 
discernible for behavioral consistency in rela- 
tion to group size. 
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Expressed and Inventoried Interests of Veterans * 


Manuel N. Brown 
Veterans Administration Hospital, Vancouver, Washington 


The purpose of this research was to deter- 
mine the relationship between the expressed 
and inventoried interests of veteran patients, 
to use Super’s terminology (14). That is, to 
learn how the patients’ self-estimates of their 
preferences corresponded to their measured 
ratings, employing the Lee-Thorpe Occupa- 
tional Interest Inventory (11). 

No study has been found in the literature 
which relates expressed interests to measured 
preferences, using the Lee-Thorpe. In fact, 
there is a striking lack of reported research on 
the Lee-Thorpe, even though the test is used 
widely. Although published in 1943, there are 
only five studies on this test reported in the 
journals. A review of the literature is there- 
fore limited to work done with the Kuder Pref- 
erence Record and the Strong Vocational 
Interest Blank. 

On the Kuder, studies comparing self-esti- 
mates to scored interests have been done by 
Berdie (2), Bordin (3, 4), Crosby and Winsor 
(7), DiMichael (9), Kopp and Tussing (10), 
and Rose (13). These studies cover high 
school pupils and college students of both 
sexes, vocational rehabilitation counselors, 
and war veterans. While a very wide range of 
relationship appears in dealing with individual 
scales or interest categories, the median cor- 
relations reported for the test as a whole range 
from .40 to .64. The point made here is that, 
with whatever group of subjects, expressed 
interests are but méderately related to in- 
ventoried scores on the Kuder. 

Similar research has been done on the 
Strong Blank by Bedell (1), Berdie (2), Bordin 
(3,4), Darley (8), and Moffie (12). Con- 
sistently, somewhat lower correlations were 
found than those reported above on the Kuder. 


* Reviewed in the Veterans Administration and pub- 
lished with the approval of the Chief Medical Director. 
The statements and conclusions published by the author 
are the result of his own study and do not necessarily 
reflect the opinion or policy of the Veterans Adminis- 
tration. 
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While the relationship between professed and 
measured preferences seems at best to be 
moderate, acquiescence with inventoried scores 
by subjects is probably much higher. Brown 
(5, 6) secured the reactions of veteran patients 
to their interest ratings, when their Kuder or 
Lee-Thorpe scores were discussed with them in 
the regular process of counseling. Using an 
index called the dissent score, client agreement 
with inventoried interests proved to be statisti- 
cally significant beyond the .01 level of con- 
fidence, employing the Chi-square method of 
evaluation. 


Subjects 


This report covers a study of 65 male pa- 
tients of a general VA hospital who had taken 
the Lee-Thorpe Occupational Interest Inven- 
tory in the regular course of their vocational 
advisement. They were all veterans of World 
War II, applicants for benefits under Public 
Laws 16 or 346, and were consecutive counsel- 
ing cases. The patients ranged in age from 
20 to 43 years, with a mean age of 29. Their 
schooling averaged the 11th grade, with a 7 to 
16 grade range. In intelligence, the average 
IQ was 105, ranging from 80 to 130 on the 
Otis S-A Higher A test. 


Procedure 


Each of these patients was asked to do the 
following: Before working on the Lee-Thorpe 
Inventory proper, he was to study the six oc- 
cupaticnal areas and their subdivisions as 
listed on page two of the test booklet. In 
front of each area described, he was to place a 
number from 1 to 6, indicating how his inter- 
ests ranked in his own estimation. He was 
then to proceed with the regular part of the 
Inventory. 

As the self-estimates were not translatable 
into percentiles to be correlated by the product- 
moment method with the actual percentile 
ratings, rank-order correlation was adopted. 
The scale showing the highest percentile score 
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ranked number 1, etc. For each subject, a 
Rho value was computed as an index of the 
relationship between his self-ratings and in- 
ventoried ratings.' 


These individual Rho’s were then arranged. 


in order of rank, and correlated by the rank- 
order method with age, education, and IQ 
respectively. N equalled 65 for correlations 
with age and education, but only 55 for IQ. 
Only cases in which an Otis S-A Higher A 
examination had been administered were con- 
sidered, as IQ’s derived from varied tests may 
not safely be grouped for correlation purposes. 


Results 


Rho’s for individual patients ranged from 
— .35 to .99; with a median of .62 for the group. 
With four degrees of freedom, fifteen of these 
correlations proved significant at the .05 level, 
and six at the .01 level. 

Individual Rho’s, as an index of the patient’s 
ability to estimate his measured interests, cor- 
related with three variables as follows: Age, 
.02; education, .11; IQ, —.11. With only 63 
and 53 df, none of these correlations is statisti- 
cally significant. 

The results of this study conform closely to 
those of Rose (13) for the Kuder. The latter’s 
group of 60 veterans revealed Rho’s ranging 
from —.05 to .99, with a-median of .64. 

Crosby and Winsor (7) reported that the 
more intelligent subjects of their group (222 
college students of both sexes), as measured by 
the ACE, came closer than the less intelligent 
to predicting their inventoried interests on the 
Kuder. Nosuch relationship was found in the 
present investigation. 


Summary 


1. Sixty-five World War II male veteran 
patients of a general VA hospital gave esti- 
mates as to how they ranked in interest in the 
6 fields of interest of the Lee-Thorpe Interest 
Inventory. These self-ratings were correlated 
with the ranked ratings on the test, and the indi- 
vidual indices derived were further correlated 
with age, education, and IQ. 


1 Use of individual Rho’s has been made by Wesley, 
Corey, and Stewart (15), in their study of intra- 
individual ee between interest and ability, 
and by Rose (13). 


Manuel N. Brown 


2. The subjects varied extremely in ability 
to predict their inventoried scores. This abil- 
ity, or self-insight, did not correlate signifi- 
cantly with age, education, or intelligence. 

3. Expressed interests agree too little with 
inventoried interests to be used without pref- 
erence tests in counseling or selection. 


Received February 8, 1951. 


References 


1. Bedell, R. The relationship between self-estimates 
and measured vocational interests. J. appl. 
Psychol., 1941, 25, 59-66. 

2. Berdie, R. F. Scores on the Strong Vocational 
Interest Blank and the Kuder Preference Record 
in relation to self ratings. J. appl. Psychol., 
1950, 34, 42-49. 

3. Bordin, E. S. A theory of vocational interests as 
dynamic phenomena. Educ. psychol. Measmt., 
1943, 3, 49-66. 

4. Bordin, ©. S. Relative correspondence of pro- 
fessed interests to Kuder and Strong test scores. 
Amer. Psychologist, 1947, 2, 293. 

5. Brown, M.N. Client evaluation of Kuder ratings. 
Occupations, 1950, 28, 225-229. 

6. Brown, M. N. Evaluation of Lee-Thorpe Inven- 
tory ratings by veteran patients. Educ. psychol. 
Measmt., 1951, Summer. Scheduled for publi- 
cation. 

7. Crosby, R. C., and Winsor, A. L. The validity of 
student estimates of their interests. J. appl. 
Psychol., 1941, 25, 408-414. 

8. Darley, J. G. Clinical aspects and interpretation of 
the Strong Vocational Interest Blank. New York: 
Psychological Corporation, 1941. 

9. DiMichael, Salvatore G. . The professed and meas- 
ured interests of vocational réhabilitation coun- 
selors. Educ. psychol. Measmt., 1949, 9, 59-72. 

10. Kopp, T., and Tussing, L. The vocational choices 
of high school students as related to scores on 
vocational interest inventories. Occupations, 
1947, 25, 334-339. 

11. Lee, E. A., and Therpe, L. P. Manual of direc- 
tions—Occupational Interest Inventory, advanced 
series. Los Angeles: California Test Bureau, 
1943. 

12. Moffie, D. J. The validity of self-estimated inter- 
ests. J. appl. Psychol., 1942, 26, 606-613. 

13. Rose, W. A. A comparison of relative interest in 
occupational groupings and activity interests as 
measured by the Kuder Preference Record. 
Occupations, 1948, 26, 302-307. 

14. Super, D. E. Appraising vocational fitness. 

_ York: Harper and Brothers, 1949. 

15. Wesley, S. M., Corey, D., and Stewart, B. Astudy 
of the intra-individual relations between interest 
and ability. J. appl. Psychol., 1950, 34, 193- 
197. 


New 


q 
| 


Limitations of the Bernreuter Personality Inventory 
in Selection of Supervisors 


Charles P. Sparks 
Richardson, Bellows, Henry & Co. 


A large oil refinery in the south initiated a 
testing program to assist in promotion of line 
employees to foreman and supervisory posi- 
tions. The program was developed in terms of 
tests, inventories and other information which 
would differentiate good from poor existing 
foremen. The Bernreuter Personality Inven- 
tory was included in the experimental test 
battery because of the frequent: reference to 
personality factors as a major factor in success 
or failure as a foreman. 

The tests were given during regular training 
programs held for foremen. Since these pro- 
grams normally covered a wide variety of 
topics and activities, the administration of tests 
did not constitute a highly unusual situation 
for the foremen. There were 492 foremen 
who regularly participated in these training 
sessions. Approximately 400 took each of the 
tests or inventories though only about 360 took 
all of the materials. The absenteeism was oc- 
casioned by illness, inability to leave work, out 
’ of town activities, and other legitimate reasons. 
The principal use to be made of the test results, 
i.e., selection of new foremen, was explained 
and discussed with the participating foremen. 
Some grumbling was noted in the testing ses- 
sions, largely with reference to inclusion of 
factual materials which appeared to have rela- 
tionship to schooling, but the general response 
was quite favorable. 

The 492 foremen were segregated into groups 
of men who supervised the same type of work 
and had the same level of responsibility. The 
men in each group were then ranked in order 
of judged performance by as many of their 
senior supervisors as were considered to know 
the work of the group. From four to twelve 
such rankers were available for ranking the 
men in each group of foremen. These rank- 
ings were converted to position scores to 
equate for differences in the size of groups and 
these scores were averaged to secure a criterion 
score for each of the 492 foremen. As might 


have been expected, there was very good agree- 
ment among the raters on the placement of 
most foremen but very poor agreement on the 
placement of some others. The over-all reli- 
ability of the criterion was estimated at .91, 
this being the median group correlation secured 
by comparing the average position score from 
half the rankers with that of the other half 
and computing the reliability coefficient by use 
of the Spearman-Brown prophecy formula. 

A sample of 241 foremen was selected for the 
bulk of the test analysis. This sample was 
representative of the total population with re- 
spect to level of supervision and type of work 
supervised. By criterion score, it included 79 
foremen who were uniformly considered among 
the best in the refinery, 86 who were considered 
among the least desirable, and 76 who were 
considered average or typical. The Bern- 
reuter Personality Inventory was completed by 
191 members of this sample of 241. These in- 
cluded 65 of the best, 66 of the worst, and 60 
average. There were 90 first line foremen, 71 
second line, and 30 third line. 

The Bernreuters were first scored on the keys 
developed from Flanagan’s factor analysis of 
the items (3). These two are described in the 
test manual (1) as F1-C, “a measure of con- 
fidence in oneself,” and F2-S, “fa measure of 
sociability.” The scores on these two factors 
correlated —.16 and —.09 respectively with 
the criterion. The scores on the two scales 
were spread widely among the foremen, but 
there was little consistency in relationship be- 
tween scores and criterion. Such relationship 
as existed showed that the foremen in the low 
criterion group had the more favorable scores. 

These refinery foremen do differ from the 
normative group. Their scores are lower on 
both F1-C and F2-S, and these differences are 
statistically significant. The critical ratios 
for the mean differences between the refinery 
foremen and the normative group are 4.1 for 
F1-C and 9.0 for F2-S. Thus it appears that 
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Table 1 . 


Average Score of High, Middle, and Low Criterion Groups on Keys F1-C (self-confidence) and F2-S (sociability), 


Compared with the Norms for Adult Males Given in the Bernreuter Manual (1) 


F1-C F2-S 


Group N Mean S.D. Mean S.D. 
High criterion 65 —69.2 62.4 — 30.9 46.2 
Middle criterion 60 —71.5 83.0 —37.7 50.2 
Low criterion 66 —85.6 72.1 —38.2 50.6 
Total foremen 191 —75.6 64.4 —35.6 48.8 
Author’s norms 914 —53.4 0.2 


selection of men with lower scores on these two 
Bernreuter scales would result in men more like 
the foreman population than like the general 
male adult population. However, within the 
foreman population, the differences between 
scores of high and low criterion groups are not 
significant. Thus, these two scales contribute 
little toward the more difficult task of identify- 
ing the best of the foremen. 

Hanawalt and Richardson (4) compared the 
item responses of 90:supervisors with those of 
88 non-supervisors and found 23 items which 
were significantly different to the extent of a 
P value of .05 or less. Later, Richardson (6) 
added some less discriminating items and de- 
veloped a “supervisor scale” of 84 items. She 
contrasted 44 supervisors with 45 non-super- 
visors on this scale, finding a significant differ- 
ence between the scores of the two groups. 

The key for this “supervisor scale” was pro- 
cured from Dr. Richardson and used to score 
the Bernreuter personality inventories of the 
191 refinery foremen. The correlation be- 
tween these scores and the criterion was .02. 
Yet again, just as was true for Keys F1-C and 


Table 2 


Comparison of Refinery Foremen Scores with Those of 

Supervisors and Non-supervisors Used by Dr. H. M. 

Richardson in Establishing Validity of an 84-item 
Supervisor Scale for the Bernreuter 


Group N 
Refinery foremen 191 28.9 12.7 
Richardson’s supervisors 44 (32.7 16.5 
Richardson’s non-supervisors 45 17.4 19.2 


N.B. Dr. Bernreuter suggests that lower scores indicate better adjustment. 


F2-S, the entire group of foremen does score in 
the more favorable direction. 

The difference between the mean scores of the 
refinery foremen and the non-supervisors group 
is significant, C.R. is 3.8. Unfortunately, no 
group of refinery employees who are not fore- 
men was available for comparison. 

Since neither the two Bernreuter scales nor 
the “supervisor scale” of Richardson differ- 
entiated the best from the poorest foremen, an 
item-analysis of the papers of the 191 refinery 
foremen was made. The responses of the 65 
best foremen were compared with those of the: 
66 worst. Use of Chi square techniques re- 
sulted in identification of 13 items which differ- 
entiated the best from the poorest at the five 
per cent level of confidence. Since there are 
125 items in the inventory, this yield is ap- 
proximately seven more than that which might 
have been expected if only chance were operat- 
ing. The papers were scored for these 13 
items and correlated with the criterion scores 
of the 191 foremen, 131 of whom were included 
in the item-analysis. The resulting correlation 
was .48. Cureton (2) has clearly pointed out 
the impossibility of drawing favorable con- 
clusions from a correlation coefficient derived 
from the same population as the item-analysis 
which furnished the data for the key. He only 
alludes to the fact that negative conclusions 
may be pointed up very forcibly. In this 
instance, a correlation of .48 is relatively small 
when the source of the key is considered and 
the potential value of the key may be further 
discounted by reference to its reliability. For 
these 13 items, the reliability for 191 cases was 
.37, odd-even items, estimated by the Spear- 
man-Brown prophecy formula. 
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Though only 13 items differentiated good 
from poor foremen with statistical significance, 
these were inspected for coherence with the 
thought that leads might be found for con- 
struction of an amplified personality inventory 
emphasizing these ideas. The results were 
somewhat startling. The items more char- 
acteristic of the best foremen in the refinery 
included the following: Working better when 
praised; considering themselves rather nervous 
persons; feeling self-conscious in the presence 
of superiors; being shy; being embarrassed at 
mistaking a stranger for an acquaintance; and 
minding taking back articles purchased at a 
store.' The idea of amplifying such items was 
dropped. Top management found it difficult 
to believe that their better supervisors would 
be inclined to mark such items in this way and 
some different approach to the problem is 
indicated. 

On the basis of this study the Bernreuter 
Personality Inventory was not recommended 
for selection of foremen at this refinery. While 
both the author’s F1-C and F2-S keys and 
Richardson’s “supervisor scale” key showed the 
foreman population to be different from non- 
supervisors, they did not discriminate between 
the better and the poorer foremen. Other 
instruments used in the study of these 241 
refinery foremen did discriminate. Several of 
these are shown in Table 3. 

The materials which showed the highest re- 
lationship with the criterion do not particularly 
tap what is customarily referred to as personal- 
ity. Yet, management insists that personality 
factors are highly important in the makeup of a 
good foreman. Indeed, in discussing the cases 
of this study where there was disagreement 
between predicted and actual criterion scores, 
senior supervisors regularly attributed the 
deviation to personality factors. : 

It is doubtful that additional work, such as 
Richardson’s item-analysis and the item analy- 
sis reported here, or further ‘factor analysis 
such as Flanagan’s or Martin’s (5) will make 
the Bernreuter more valuable as a selection 
device for foremen and supervisors. First, 
the preference value or “halo” of the items is 
too clearly evident. The respondent can 


1The 13 items found significant (P is .05 or less) 
were: 7, 102, 14, 20, 24, 15, 65, 30, 111, 70, 63, 97, 
ana 92. 


Table 3 


Correlation Between Test and Criterion Scores 
of Refinery Foremen 


Corre- 
Test N lation 
Bennett Mechanical Comprehension 
Test 214 32 
Otis Mental Ability Test, Gamma 216 34 
Richardson, Bellows and Henry Test 
of Supervisory Judgment 187 54 
Kuder Mechanical plus Computa- 
tional Interests 193 27 
Bernreuter, F1-C 191 —.16 
Bernreuter, F2-S 191 —.09 


N.B. Differences in N are occasioned by absentee- 
ism, with different foremen absent from the various 
test sessions. 


easily determine what he believes to be the 
correct response. Second, the items are too 
“global” in nature. That is, the situations 
involved may never occur within the scope 
of the foreman’s activities. This is not nec- 
essarily bad, but it does demand either con- 
stancy of relationship between foreman and 
non-foreman actions or correlation of a high 
order. 


Summary 


The Bernreuter Personality Inventory was 
included in an experimental test battery given 
to foremen in a large oil refinery. Scores from 
keys F1-C (self-confidence) and F2-S (socia- 
bility) did not correlate significantly with a 
criterion derived from rankings of senior super- 
visors. A special “supervisor scale” developed 
by Richardson also failed to correlate with the 
criterion. The foremen did appear to be a 
different population than the normative group 
of Bernreuter and the non-supervisory group of 
Richardson, although the keys did not differ- 
entiate good from poor foremen. An item- 
analysis on the 191 foremen included in the 
study yielded 13 items which discriminated the 
best from the poorest foremen, seven more than 
would have been expected by chance. The 
reliability of a scale composed of these 13 
items was too low, however, to warrant further 
work. 


Received January 26, 1951. 
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Validities of the Forced-Choice and Questionnaire Methods 
of Personality Measurement * 


Leonard V. Gordon 


University of New Mexico** 


The validity of the questionnaire method in 
personality measurement has received much 
attention in the literature ever since the incep- 
tion of the technique. That this is an impor- 
tant matter is attested to by the wide use that 
this method has enjoyed. Despite this wide 
use, psychologists generally have not been 
satisfied with the performance of the personal- 
ity questionnaire. 

Kornhauser (10) obtained the opinion of a 
representative sample of psychologists as to 
how satisfactory they considered personality 
inventories of the Bernreuter, Bell, or Humm- 
Wadsworth type to be. Eighty-five per cent 
of the respondents rated the questionnaire as 
being “doubtfully satisfactory” or worse. 

Even though Ellis (3), in his.review of valid- 
ity studies, was generous both in accepting re- 
ported validities at face value and in inferring 
validities from data reported, he was forced to 
conclude that the performance of the personal- 
ity questionnaire generally has proven disap- 
pointing. However, Ellis and Conrad (4) 
report better performance for this method in 
military application. 

A primary reason for the low validity of the 
personality questionnaire is the motivation of 
a majority of respondents to mark-socially ac- 
ceptable alternatives to items, rather than those 
which they believe apply to themselves. 
Horst originally suggested a method for at- 
tacking this problem (20) which was developed 
by Wherry, both for personality questionnaires 
and rating scales, into what has come to be 
known as the forced-choice method (17). 

In essence, the forced-choice method in- 
volves the presentation of pairs of items that 
have been equated for preference value but 


*The writer is indebted ‘to Professor Robert J. 
Wherry for many helpful suggestions in the present 
study. A major portion of this study was performed in 

artial fulfillment of thé requirements for the degree of 
Doctor of Philosophy at Ohio State University. — 
** Visiting, 1951-52. 


which differentially discriminate on a criter- 
ion.! A major assumption underlying this 
technique in personality measurement is that 
if two items are equally derogatory from the 
point of view of the social group, individuals to 
whorn one of the items is more applicable will 
tend to perceive that item as being the less 
derogatory. Thus, if an individual, who is 
motivated to make socially acceptable re- 
sponses, is forced to select one of the items as 
being /east like himself, he will select the item 
that he perceives to be most derogatory, which 
will tend to be the item that is less like himself. 
It is assume? that the converse holds when he 
is forced to select one of a pair of complimen- 
tary items as being most like himself. Variants 
of the method use sets of 3, 4, or 5 items and 
may force both a most and least choice for 
each set. 

Aside from clinical evidence, support for 
the operation of this projective principle can 
be found in studies by Finger (5), Gordon (6), 
Travers (16), and Wallen (18). These studies 
show that individuals, to whom particular be- 
haviors apply, tend to perceive these be- 
haviors as being more prevalent in the group 
than do individuals to whom these behaviors 
do not apply. For socially undesirable be- 
havior, that which is perceived as being more 
characteristic of the group probably will be 


1 The term forced-choice frequently has been used to 
refer to practically any measurement situation in which 
the individual is required to make some choice from a 
set of stimuli. This writer is restricting the term forced- 
choice to measurement situations in which the stimuli 
have been matched for equality of preference value and 
also have differential discriminating ability in a speci- 
fied situation. Such an operational definition serves to 
distinguish the technique used here from other multiple 
choice applications, as for example the Kuder (11) Test 
in which stimuli were matched for differential discrimi- 
nating ability but not for equality of preference value or 
the Jurgensen (9) Test in which the stimuli were 
matched for equality of preference value but not for 
differential discriminating ability. 
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taken to be the more acceptable to admit about 


oneself. 

In the forced-choice situation, where alterna- 
tive traits apply equally to an individual, a 
chance distribution of responses should leave 
the profile unaffected. Where the alternative 
traits do not apply equally, but where the 
alternative items are perceived as being equally 
derogatory or complimentary, some validity 
would be expected since guessed subliminal dis- 
criminations tend to fall in the direction of the 
true measure. 

The forced-choice method has been reported 
as being effective in reducing rater bias in the 
military service and in industry (14, 15). The 
present study was undertaken to compare the 
validities of the forced-choice and question- 
naire methods in self-report personality meas- 
urement. 


Development of the Initial Tests 


In order to obtain a meaningful comparison 
between the two methods, it was necessary to 
construct a personality questionnaire and a 
forced-choice test, both of the same factorial 
structure and containing, as far as possible, 
the same item content. 

The factorial literature in the field of per- 
sonality was reviewed and Cattell’s (2) factors 
E, G, H, A, and K were tentatively selected as 
personality dimensions to be used in the pres- 
ent study. Since some of the factors were 
significantly intercorrelated, items with high 
loadings on one of the factors and low loadings 
on the others were selected as guide items. 
Since none of these items represented the 
hypersensitivity content characteristically 
found in personality inventories, items with 
high loadings on Mosier’s (13) Hypersensitiv- 
ity factor were also included. Trait names 
from the Allport and Odbert (1) list that were 
similar in content to the above items were 
used to develop new items. Three hundred 
of the items finally developed were divided 
into two forms of 150 items each.2? One form 

2 Since it was believed that the location of an item 
might effect its preference value, three subforms were 
developed within each form, with the items appearing 


in different positions on each. The preference value 
used was the average for the three positions. The 
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was administered to 390 students and the other 
form to 282 students. 

The students were instructed to consider 
each item and decide the degree to which it 
applied to them, using the following key: 1. 
Always or almost always applies to me; 2. 
Usually applies to me; 3. Applies to me about 
as often as not; 4. Occasionally applies to me; 
and 5. Never or almost néver applies to me. 

A separate factor analysis was performed on 
the 150 items in each form, using the Wherry- 
Gaylord (19) iterative method. Six factors 
emerged in each analysis, having intercorrela- 
tions ranging from .05 to .35 in one form and 
—.17 to .42 in the other. The axes were ro- 
tated to orthogonality, and the factors rotated 
for simple structure. When the factors in the 
two forms were identified and compared, for 
every factor in one form there was a counter- 
part in the other form with highly similar con- 
tent. Thus, amalgamation of items from 
corresponding factors appeared to be justified. 
The six factors were: I—Ascendency ; II—Gen- 
erosity; IlI1—Hypersensitivity; IV—Refine- 
ment; V—Responsibility; and VI—Sociability. 
Two identifying items from each factor are 
presented below: 


I. Can be influenced rather easily 
Would rather follow than lead 
II. Ready to take but not to give 
Concerned about the well being of others 
. Calm and easygoing in manner 
Easily upset when things go wrong 
IV. Well mannered in social situations 
Tends.to be rather ill bred 
V. Sees a job through despite difficulties 
Doesn’t take responsibilities seriously 
VI. Likes to come in contact with new people 
Would rather not attend social gatherings 


The preference value of each item was ob- 
tained by tallying the frequencies for each re- 
sponse category, weighing the response cate- 
gories from 1 to 5 (1 for “Never . . . ,” 2 for 
“Occasionally . . .” etc.), and computing the 
mean. Items from each factor were paired 
with items from each of the other factors for 
equality of preference value. The preference 
value of the Refinement items were so extreme 
effect of position on the preference value and its impli- 


cations for forced-choice test construction are di: 
elsewhere (8). 
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that they could not be matched with items from 
other factors, necessitating the elimination of 
this factor for consideration for the forced- 
choice test. Generosity had too few items to 
permit its inclusion. 

Pairs of low preference items from two factors 
were grouped with pairs of high preference items 
from the remaining factors to form tetrads. 
In all, 24 tetrads were developed to form the 
forced-choice test. A sample tetrad is pre- 
sented below: 


A good mixer socially 

Gives in readily to other people’s wishes 
Thorough in any work undertaken 
Worries about possible misfortune 


The respondent is instructed to indicate 
which item is most like and which item is Jeast 
like him for each tetrad. For the individual 
who is out to “beat” the test the apparent 
freedom to accept or reject any item is decep- 
tive. Where the individual can discriminate 
high preference items from low preference 
items, he will select a high preference item as 
most and a low preference item as least. This 
automatically ranks the omitted high prefer- 
ence item as Jess and the omitted low prefer- 
ence item as more. This forced ranking may 
be capitalized on in the scoring key. 

Forced-choice format still allows the consci- 
entious respondent the opportunity to mark 
low preference items as being most like himself, 
and high preference items as being least like 
himself. This is the type of response that 
contributes most to the validity of the con- 
ventional questionnaire. 

Twenty-three items having the highest load- 
ings on Ascendency, Hypersensitivity, Re- 
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finement items appeared justified since these 
were included in the original presentation. It 
was believed that inclusion of these items would 
give the questionnaire greater variety and thus 
have a salutary effect. The questionnaire 
contained 110 items, alternated by factor, and 
employing the same five point key as the 
original. 


Validation of the Initial Tests 


The forced-choice test and questionnaire 
were administered at a college dormitory for 
women, followed by: a nominating session. 
The resulting data were used to revise the 
questionnaire and to develop empirical keys 
for the forced-choice test. 

The forced-choice test was answered by 104 
members of 6 corridors and the questionnaire 
by 104 members of 7 corridors. After the test- 
ing, each subject nominated 3 girls from her 
corridor as best fitting each of the following A 
and B descriptions for Ascendency: 


A. Tends to take the lead in group discussion 
A strong positive influence on others 
Willing to defend her own opinions 
A self-assured person 

B. Would rather follow than lead 
Tends to be readily influenced by others 
Would rather agree than argue 
Gives in quite easily to others 


The procedure was repeated independently 
for Responsibility, Hypersensitivity, Sociabil- 
ity and Refinement. Further description of 
the rating scales and results of the nominations 
may be found elsewhere (7). 

Criterion scores were determined by the 


sponsibility and Sociability were selected for number of A and B nominations, weighted for } 
the questionnaire. The inclusion of 18 Re- the number of girls in the corridor. Split-half 7 
Table 1 i 
Test Reliabilities (S.B.) and Validities and Criteria Reliabilities (S.B.) é 
for the Initial Validation Study ‘ 
Validity Test Reliability Criterion Reliability 
Forced _ Question- Forced- Question- Forced- Question- 
Scale Choice naire Choice naire Choice naire 
Ascendency 332 A57 .719 809 899 915 
Hypersensitivity .294 803 .933 849 712 
Responsibility 238 464 821 852 864 881 


Sociability 438 153 914 827 887 867 
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Table 2 


Intercorrelations among Scales and Criteria for the Final Validation 
Group with Reliabilities (S.B.) in the Diagonals 


* Forced-Choice 


Questionnaire 


H R H 


R 


829 
199 - .750 .790 
145° .036 


> 


825 


A 
H .183 .867 
749 R 345  .467 857 
—-08 742 S 058 —-077 912 


reliabilities, obtained by dividing the rating 
forms into random halves, were corrected by 
the Spearman-Brown formula and are pre- 
sented in Table 1. 

The questionnaire was scored with the 
original pass-fail keys, in which the 5 self-rating 
degrees had been dichotomized for each item 
so as to give the closest approximation to a 50 
per cent cut. The score on each scale was the 
number of items passed. In the forced-choice 
test, a score of plus one was given for each high 
preference item marked most and for each low 
preference items marked Jeast, and a score of 
minus one for each high preference item marked 
least and each low preference item marked 
most. The scale score was the algebraic sum 
for that scale. onto 

Corrected split-half reliabilities and validi- 
ties for the questionnaire and forced-choice 
scales are presented in Table 1. No striking 
differences appear between the scales that 
were developed by the method of internal 
consistency. However, the validities of all 
four of the forced-choice scales but only two 
of the questionnaire scales are significantly 
greater than zero at the 2 per cent level of 
confidence. The only significant scales differ- 


Table 3 
Test Validities for the Final Validation Groups 


ence is for Sociability, in which the forced- 
choice method is superior. 


The Revisions and their Validations 


On the basis of an item-criterion analysis, 
a revised questionnaire was developed contain- 
ing the 10 items which best discriminated the 
high from the low criterion individuals in each 
scale. A similar item-criterion analysis for 
the forced-choice test resulted in the develop- 
ment of an empirical scoring key for each scale. 

Both the revised questionnaire and the 
forced-choice test were administered to 63 
females from 5 dormitory units and 55 males 
from 5 dormitory units at a small college. 
The testing was followed by a nominating 
session which was procedurally the same as for 
the initial validation group. 

Criteria intercorrelations and corrected reli- 
abilities for these groups are presented in 
Table 2. The reliabilities indicate that the 
subjects were in substantial agreement in 
their nominations. The range of criteria 
intercorrelations is about the same as that 
obtained by Cattell (2). 

Intercorrelations among scales and scale 
reliabilities are also presented in Table 2. The 


Male 


Question- 
Scale Choice naire 


p Forced- 
Choice 


Question- 


Ascendency 499 
Hypersensitivity 457 
Responsibility 569 306 
Sociability 331 


729 523 1% 
1% 142 1% 
— 611 385 2% 


410 | 4 
A | A H R S 
A 817 730 
H 021.785 
R 133 .062 
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Table 4 
Validities of the Scales Corrected for Attenuation 


Male 


Female 


Scale Choice 


Question- 


Forced- 


Question- 
Choice 


naire naire 


Ascendency 598 
Hypersensitivity 726 
Responsibility 663 
Sociability 


.259 546 
848 625 
366 588 -182 
403 -706 468 


corrected reliabilities of the forced-choice 
scales are of the same magnitude as those pre- 
viously obtained. The lower corrected reli- 
abilities of the questionnaire scales can be 
attributed to their reduction in length. Apply- 
ing the Spearman-Brown formula to predict 
the reliabilities of scales of the original length, 
results in prophesized reliabilities of the same 
magnitude as those originally obtained. 

The validities of the forced-choice and 
questionnaire scales are presented in Table 3. 
Composite probabilities for both groups indi- 
cate the superiority of the forced-choice method 
over the questionnaire method for all scales 
beyond the 5 per cent level of confidence, and 
at higher levels for individual scales.* 

Discussion 

In the present study the forced-choice 
method has been found to be more valid than 
the questionnaire method in the measurement 
of four personaliyy traits. Investigation of 
the assumption that the superiority of the 
forced-choice test stems from its methodo- 
logical approach is in order. 

In the initial validation, sizable correlations 
occurred between the Hypersensitivity and 
Responsibility criteria for the forced-choice 
sample but not for the questionnaire sample. 
Selection of items for the empirical forced- 
choice keys took this criteria correlation into 
account. This resulted in an advantage for 
these forced-choice scales in the final validation, 
since these criteria were also substantially 
correlated for this group. To eliminate this 
accidental advantage, the Responsibility and 
Hypersensitivity scales were rescored with the 
* The test for the significance of the differences in 


validities follows McNemar (12) for the case where an 
element of correlation is present. 


original forced-choice keys, in effect giving a 
theoretical advantage to the questionnaire. 
Although the correlation dropped from .75 to 
.15, these forced-choice scales maintained their 
superiority over those of the questionnaire. 
Thus, the superiority of the forced-choice test 
cannot be ascribed to this chance advantage 
for the two scales. 

The validities, corrected for attenuation 
(Table 4), indicate that the differences in 
validity cannot be ascribed to differences in 
reliability or test length.‘ In fact, the upper 
limit of the validities of the questionnaire 
scales run lower than the obtained validities of 
the forced-choice scales. 

Although the forced-choice test and ques- 
tionnaire appear to be different approaches to 
personality measurement, both tests used as 
multiple predictors do not have significantly 
higher validty than the forced-choice test alone. 
For none of the scales did the multiple correla- 
tion of the battery give an increase of as much 
as .01 over the forced-choice validity alone. 

The forced-choice test, like the question- 
naire, is most valid for low scores. Indi- 
viduals can be selected with greater confidence 
as being at the undesirable end of the trait 
continuum than at the desirable end. Appar- 
ently the forced-choice test has greater success 
than the questionnaire in producing low scores 
for low criterion individuals. For one thing, 
the usual approach for “beating” a test is not 
there. The individual can no longer say 
“yes” to all good items and “no” to all bad 
items. He is continually forced to make 
rankings within each tetrad. He may un- 
knowingly make a low score by simple omis- 
sion, that is, by consistent neglect of items in a 


‘In correcting for attenuation, separate criteria reli- 
abilities for the male and female groups were used. 
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particular scale. It should be indicated that 
the very factors that increase the validity, of 
the lower end of the scale result in increased 
validity at the upper end by preventing low 
criterion individuals from making high scale 
scores. 

There are a number of problems in forced- 
choice test construction, the discussion of 
which is beyond the scope of the present paper. 
Considerable systematic research will be 
necessary for the development of techniques 
for constructing optimal forced-choice tests. 
Since the forced-choice method has proved it- 
self in several fields of application further re- 
search toward the development of a forced- 
choice theory should be of great value. 


Summary 


1. A personality questionnaire and forced- 
choice personality test, both of the same fac- 
torial structure and containing much the same 
item content, were constructed by the method 
of internal consistency. The tests consisted 
of the following factors: Ascendency, Hyper- 
sensitivity, Responsibility, and Sociability. 

2. The initial validation on a female sample 
showed no striking differences between the 
methods. However, the validities of all four 
of the forced-choice scales but only two of the 
questionnaire scales were significantly greater 
than zero. 

3. Item-criteria analyses resulted in the de- 
velopment of a revised questionnaire and em- 
pirical scoring keys for the forced-choice test. 
The ‘revised tests were administered to male 
and female criteria groups, nominations being 
used as criteria. For all four scales, the forced- 
choice method was found to be more valid than 
the questionnaire method. 

4. Multiple correlations indicate that the 
questionnaire adds nothing towards the pre- 
diction of the criteria when placed in a battery 
with the forced-choice test. 


Received February 1, 1951. 
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The need for a personality test designed 
specifically for personnel selection has long 
been recognized. There is general agreement 
that an individual’s personality attributes are 
important determinants of his vocational or 
scholastic success. It is probable that the 
effectiveness of many of our selection programs 
could be considerably improved if we had a 
satisfactory technique for personality evalua- 
tion. Unfortunately, owing to the complex 
and intangible nature of personality itself, only 
modest progress has been made in the develop- 
ment of personality tests which are actually 
predictive of later performance. Projective 
tests, in their present stage of development, in 
general have not yet manifested satisfactory 
validity in the selection of personnel. The 
usual type of questionnaire is also of limited 
value when dealing with applicants. There- 
fore, a personality test that could be validly 
used for personnel selection would be an in- 
valuable contribution to many testing pro- 
grams. It is the purpose of this paper to 
evaluate one attempt to devise such a test. 


The Classification Inventory 


In the December, 1944 issue of this jourial, 
Clifford E. Jurgensen (5) reported on a per- 
sonality test that he had developed which he 
called the Classification Inventory. In addi- 
tion to being one of the few tests of personality 
designed primarily for personnel selection, it is 
rather unique in utilizing a modified forced- 
choice technique. One other such test, called 
the Personal Inventory, was developed for the 
Navy, and Shipley, Gray, and Newbert (15) 
report that this test discriminated significantly 
between normal and psychiatric groups. Fur- 
thermore, the success of the forced-choice 
method in the related area of personnel rating 
(16) indicates that this approach may be a 
partial answer to the difficulties which are 
inherent in the complex problem of personality 
assessment in the selection situation. 

In order to avoid certain difficulties en- 
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Cross-Validation of a Forced-Choice Personality Inventory 


James J. Kirkpatrick 
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countered when using the conventional per- 
sonality test of the questionnaire type, Jur- 
gensen (5) set up the following commendable 
requirements for his Classification Inventory: 


1. The applicant must not be able to predict 
the ‘‘right’’ answer in order to gain favorable 
consideration. 

2. It should be possible to give an answer to 
the items. Responses should not be forced 
into a “yes” or “no’’ dichotomy, and there 
should be no ambiguous “‘?”’ category. 

3. The test should be analyzed and validated 
on particular jobs rather than on traits, which 
lack precise meaning. 

4. The validation sample should be com- 
parable to the population for which the test is 
to be used eventually as a selection instrument. 


The Classification Inventory, in its present 
1947 edition, consists of 288 items, of which 216 
are in triad form and 72 are in paired form. In: 
each triad the examinee is required to select 
his first and his third choice, and in each pair 
he indicates his preference. Testing time is 
approximately forty-five minutes, and the test 
is suitable for group administration. 

As indicated in the last two requirements 
mentioned above, Jurgensen (5) recommends 
selecting items on the basis of an item analysis 
and validating the test for specific jobs on 
representative samples. However, it is in 
this connection that Jurgensen’s actual pro- 
cedure leaves much to be desired. While he 
reports several validity coefficients ranging 
from .67 to .81, these values are based on two 
small samples, one consisting of 40 salesmen 
and the other composed of 30 graduate stu- 
dents. In addition, these validity coefficients 


were not obtained from hold-out groups, since . 


the subjects in the corresponding item analysis 
groups are the same 40 salesmen and 30 gradu- 
ate students. Unfortunately, the failure to 
cross-validate is a deficiency found all too fre- 
quently in the psychological literature, e.g., 
(4, 8, 11, 12, 17, 18). 


1 It is not implied that all of these investigators were 
unaware of this problem. 
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Other Validity Studies of the 
Classification Inventory 


Subsequent research dealing with the Classi- 
fication Inventory has yielded rather contra- 
dictory results. In an investigation of the test 
as a predictor of college achievement, Adams 
(1) obtained validity coefficients ranging from 
—.01 to .06. The total number of students 
used in this study was 585. With an industrial 
sample of 176 subjects, Pred (14) also obtained 
negative results in an investigation of the 
validity of the test in distinguishing between 
“good” and “poor” industrial supervisors. 
Both of these studies utilized cross-validation 
procedures. 

More favorable conclusions regarding the 
Classification Inventory were reached by 
Knauft (9) in an investigation designed to 
predict managerial success. Although he re- 
ports a correlation coefficient of .64 between 
test scores and a composite criterion of job 
success for 79 subjects, he warns that this value 
must be interpreted with considerable caution 
since 54 per cent of the sample on which this 
coefficient is based was used in the item analy- 
sis. However, Knauft did carry out a cross- 
validation study on 32 cases, obtaining a / 
ratio significant at the five per cent confidence 
level, based on the difference between the mean 
test scores for the 16 highest on the criterion 
and the 16 lowest. 


~ The Present Study 


This study is concerned with the validity of 
the Classification Inventory as a predictor of 
college achievement. In addition, an attempt 
is made to throw some light on the question of 
cross-validation, 

The test was administered.in group situa- 
tions to 261 male students enrolled in intro- 
ductory and advanced psychology courses at 
the University of Tennessee. The nature of 
the Classification Inventory was explained, and 
the subjects were informed as to their role in 
this experiment designed to evaluate the test. 
The subjects were encouraged to fill in the 
inventory as accurately as possible. 

Rather than developing scoring keys on the 
basis of assumed personality traits, Jurgensen 
(5) recommends the use of a criterion of job 
success, since, from an operational standpoint, 
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the objective is to differentiate among persons 
exemplifying varying degrees of job profi- 
ciency. Therefore, in this study, the criterion 
of job success which seemed appropriate was 
grade point average, since this offers an ob- 
jective index of success in an important aspect 
of student life. 

For selecting valid items and assigning item 
weights, a simplified level of significance 
method of item analysis was employed. This 
technique, which is based on the chi-square 
test and is set up in table form, was developed 
by Cureton and the writer (2). A distinctive 
feature of this procedure is that the necessity 
for converting frequencies of response to pro- 
portions is eliminated. Once the frequencies 
of response are tabulated, the sums of these 
frequencies for the upper and lower groups com- 
puted, and the differences between the sums 
for the upper and lower groups ascertained, it 
is only necessary to refer to the table in order to 
arrive at the validities of the items directly in 
terms of significance level. 

For the purpose of developing scoring keys 
for the Classification Inventory, an item analy- 
sis was performed on a sample of 179 subjects 
selected randomly from the total group of 261 
male students. The remaining 82 subjects 
were used in the validation of the test. Of the 
179 subjects in the item analysis sample, the 
top 50 on the criterion were selected as the 
upper group, and the lowest 50 were selected 
for the lower group. Thus, the upper and 
lower groups each represented approximately 
28 per cent of the total item analysis sample. 

Jurgensen (6) recommends selecting items 
which are found to discriminate between upper 
and lower groups at a level of significance of 
.10, assigning increasing weights for individual 
items in proportion to the significance with 
which they function. While such a differ- 
ential weighting system is generally consid- 
ered an unnecessary refinement, this procedure 
was followed for the first scoring key by as- 
signing a weight of one to those items which, 
on the basis of the item analysis, discriminated 
at the .10 level of significance up to the .02 
level, a weight of two for those items from .02 
up to .002, and a weight of three for those items 
at .002 andabove. Both positive and negative 
weights were used, a positive weight being as- 
signed when the upper group was found to have 
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the larger frequency of résponse to a given item, 
while a negative weight was assigned when this 
situation was reversed. This first key is 
termed the weighted key. A second scoring 
key was derived without differentially weight- 
ing the items, and this will be referred to as 
the unweighted key. 

The validation sample was composed of the 
hold-out group of 82 male subjects. Their 
tests were scored by both of the keys derived 
from the item analysis. A constant of 100 
was added to each score for the purpose of 
avoiding negative values. The relationship 
between test scores and the criterion of grade 
point average was then estimated by comput- 
ing Pearson correlation coefficients. For the 
weighted key, a validity coefficient of .16 was 
obtained, while the unweighted key yielded a 
validity coefficient of .12. From Guilford’s 
Table D (3, p. 324), it can be concluded that 
neither of these correlation coefficients is sig- 
nificantly greater than zero at the five per cent 
level of confidence. 

Therefore, the results of this validity study 
of the Classification Inventory are essentially 
negative. At least for the population for which 
.this student sample is representative, the test 
has not been demonstrated to possess satis- 
factory validity as a predictor of scholastic 
achievement. However, it is possible that 
the test may be of value in some situations, 
while being invalid in others. In the study 
mentioned previously, Knauft (9) does present 
some evidence for its validity, based on a hold- 
out group of 32 cases. ‘ 

A tentative explanation of the negative re- 
sults obtained with the Classification Inventory 
as a predictor of academic-standing is suggested 
by a consideration of the factors that go into 
grade point average. That intellectual factors 
are of much import is indicated by the relatively 
high relationship often found between intelli- 
gence test scores and college grades. On the 
other hand, the influence of personality factors 
on grades has not been clearly established. It 
may well be that the personality attributes 
measured by this test are relatively insignifi- 
cant, as compared with intellectual factors, in 
accounting for the variance in college grades. 
To the extent that this is true, these personal- 
ity trends tend to be obscured upon cross- 
validation. Therefore, the low validities found 
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in this study do not indicate that the test may 
not be of value in other situations and in com- 
parison with other types of criteria, nor is the 
value of item analysis questioned by these re- 
sults. However, the evidence presented in 
this study points to the fact that if the validity 
of a test is not determined on a hold-out group, 
the results are likely to be misleading. Valid- 
ity cannot be created by item analysis. How- 
ever, the impression of validity may be created 
by failing to follow cross-validation procedures. 

When hold-out groups are not used, validity 
coefficients will be spuriously high. Just 
how much error is involved is a moot question. 
A secondary purpose of the present study is to 
present empirical evidence to illustrate the 
extent to which the failure to cross-validate 
introduces error into a validation study. In 
line with this objective, the weighted key, 
which yielded a validity coefficient of .16 on 
the hold-out group, was used in scoring the 
tests for the 100 subjects in the upper and lower 
item analysis groups. In other words, scores 
were obtained on those individuals whose an- 
swer sheets were previously used for selecting 
and weighting items in the development of that 
scoring key. When these scores were corre- 
lated against the grade point average criterion, 
a tremendous increase in the validity coeffi- 
cient resulted: i.e., from .16 to .76.2 Since this 
is the sort of procedure that Jurgensen (5) 
actually followed, the high correlations that he 
reports (from .67 to .81) become extremely 
suspect; especially when it is noted that his re- 
sults are based on samples of only 40 and 30 
subjects.* 

The aforementioned Pearson correlation co- 
efficient of .76 may be somewhat questionable 
on the basis that it was computed on the 100 
subjects in the two tails of the distribution, 

2In this connection, mention should be made of an 
investigation performed by Kurtz (10), which substan- 
tiates this finding. Kurtz conducted a validity study 
of the Rorschach Test as a predictor of managerial 
success. Managers of life insurance agencies were rated 
as satisfactory or unsatisfactory by their supervisors, 
and using this outside criterion, a scoring key was 
developed. When this scoring system was applied to 
the same sample, 79 out of the 80 managers were cor- 
rectly classified by the test. However, upon cross- 
validation on another group of 41 managers, the scoring 
key yielded scores which “showed no relation whatever 
to managerial success.” 

3 In the 1948 manual for the Classification Inventory, 


Jurgensen (6) subsequently has recommended that the 
validity of the test be determined on a hold-out-greup. 
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each of which included approximately 28 per 


cent of the sample. To investigate this possi- 
ble source of error, a biserial r for widespread 
classes was computed according to the method 
developed by Peters and Van Voorhis (13, p. 
385). The biserial r validity coefficient was 
found to be .70, which again emphasizes the 
amount of error introduced when cross-valida- 
tion procedures are not employed. It is inter- 
esting to note that the regular product moment 
r and the biserial r are in close agreement, both 
being computed from the same data. The bi- 
serial r for widespread classes is probably the 
better estimate of the validity coefficient in 
this particular situation and will be referred to 
in the following comparison with the cross- 
validation coefficient. 

The validity coefficients obtained in this 
study are as follows: 


Unweighted key (N = 82)............. 12 
Weighted key (N = 82)............... -16 
Weighted key (N = 100).............. ; 


Although the increase in the validity co- 
efficient from .16 to .70 is large enough to be 
convincing, a test of the significance of the 
difference between these two r’s was made. 
Using Fisher’s z transformation, a critical 
ratio of 4.66 was obtained, leaving little doubt 
concerning the existence of a real difference. 

Admittedly, it is often difficult to secure 
samples of sufficient_size in the practical situ- 
ation, particularly in the case of studies re- 
quiring both item analysis and validation. It 
is recognized that the size of the groups used 
in the present study is not all that might be 
desired. Katzell (7) offers a possible solution 
to the dilemma confronting the psychologist 
who has available a sample of limited size upon 
which to determine the validity of a test that 
requires a preliminary item analysis. This 
suggested procedure involves the selection of 
two relatively small random samples, deriving 
two separate scoring keys, and performing a 
double cross-validation. Items for which the 
composite validities are significant are re- 
tained for the final key. This technique seems 
to be promising, for it apparently would ex- 
tract the maximum information from a given 

‘ Pseudo-validity coefficient determined without cross- 


validation and estimated by the Peters and Van Voorhis 
biserial r for widespread classes. 
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set of data. Of course, it would be desirable 
to have a formula which would allow one to 
validate on the same sample used for the item 
analysis, and then step down the obtained 
validity coefficient to an estimate of its size 
had it been obtained on a hold-out group. A 
formula for this purpose unquestionably would 
be of great value to the psychologist, who often 
is unable to secure samples of adequate size. 
While such a formula may not be impossible 
to derive, it has not been derived so far. Until 
such a formula is developed, the recommenda- 
tion to follow cross-validation procedures can- 
not be made too emphatic. It is not sufficient 
merely to comment that results obtained with- 
out cross-validation should be interpreted with 
caution; there is no legitimate interpretation 
of such correlations.® 


Summary 


This study was designed to investigate the 
validity of the Jurgensen Classification In- 
ventory, a personality test designed for per- 
sonnel selection, as a predictor of academic 
achievement of college students. The test 
was administered to 261 male students. Using 
grade point average as the criterion of scholas- 
tic success, an item analysis was completed on 
179 subjects selected at random from the total 
group. Two scoring keys were derived, and 
their validities determined on the hold-out 
group of 82 students by computing Pearson 
correlation coefficients between test scores and 
the criterion. For both scoring keys developed, 
validity coefficients were positive but lacked 
statistical significance. The higher correlation 
coefficient was found to be .16. As a result of 
this investigation, it may be concluded that 
for this population, the Classification Inventory 
has not been shown to be sufficiently valid to 
warrant its use as a selection device. The 
fallacy of failing to use a hold-out group for 
test validation was illustrated by computing a 
validity coefficient on the same subjects used in 
‘the item analysis group, with the result that 
the correlation coefficient jumped from .16 to 
-70. 


Received December 18, 1950. 
5 For a humorous account of the issue of cross- 


validation, see Cureton, E. E. Validity, reliability and 
baloney. Educ. Psychol. Measmt., 1950, 10, 94-96. 
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Part-Time Employment for the Older Worker * 


Jeannette E. Stanton 
The Ohio State University 


The large and increasing numbers of older 
people in this country are now much mentioned 
(5). It is predicted that by 1980 more than 
50 per cent of the population will be over 45 
while those over 65 will have increased from 
114 million in 1950 to over 20 million (1, 2, 3). 
This increase of older persons brings the prob- 
lem of maintaining them in a productive capac- 
ity. “Even in 1948, a period of ‘minimum’ 
unemployment generally, unemployment rates 
for wage and salary workers aged 45 or over 
were significantly higher than for younger 
adults. . . . With the rise of unemployment 
after 1948, older workers were especially hard- 
hit. Between the first quarter of 1948 and the 
corresponding period of 1950, the unemploy- 
ment rate for all wage and salary workers in- 
creased by slightly less than 80%, while the 
rate for workers aged 45-64 more than doubled” 
(6). Moreover, asa result partly of retirement 
plans at 65 (or 60 for women in some businesses) 
and of attitudes against the older worker, those 


past these ages are less often employed than- 


formerly; in 1890, 74% of men 65 and over 
were in the labor force but in 1950 this had 
dropped to 45 per cent (4). 

The mounting defense effort is now causing 
an increasing labor shortage. Many businesses 
are already searching for additional workers; 
and the older age groups are being recognized 
as important. possible sources. The present 
thus seems a strategic time for exploring possi- 
bilities as toemployment of older people. This 
paper reports an inquiry in these directions. 


Materials of the Study 


Through the cooperation of the employment 
office of a large midwestern department store, 
the writer had access to the personnel records 
of some 3,000 “extra” employees or persons not 
on the regular payroll but who come to work 
when called. The findings for this group 


* This investigation was supported by the Ohio State 
University Development Fund and was under the direc- 
tion of Dr. S. L. Pressey. 
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would seem particuarly relevant to the prob- 
lem of the older person, perhaps over 65, who 
seeks less than full-time employment. The 
extra employees were of three types: sales, 
non-sales, and terminal. As would be ex- 
pected, sales employees are those people whose 
major responsibility is the selling of merchan- 
dise to the customer. The non-sales group has 
a wide variety of jobs ranging from cashiers to 
watchmen and janitors. In a large depart- 
ment store the proportion of selling to non- 
selling employees is about 40 per cent selling to 
60 per cent non-selling. In the non-sales 
category the males are in the majority, whereas 
in the sales the females predominate. Term- 
inal employees in this study are those extra 
employees, both sales and non-sales, who had 
terminated their availability with the company 
during 1950. At the time of this study the 
company had 3,027 regular full-time employees, 
1,898 extra employees, and 1,163 former extras 
who had dropped their connections with the 
company during 1950. The question was as 
to the ages of these extra workers, and as to 
comparative satisfactoriness of the older ones. 


Results 


As indicated above, the major purpose of this 
inquiry was to investigate the extent to which 
older persons might well find employment in a 
department store as extra or occasional work- 
ers. The first question is the extent to which 
extra work was done by persons in the various 
age groups. Table 1 yields information first 
of all regarding the age when hired of these 
extra workers. In view of the tendency of 
many employers not to take on new workers 
over 45 years of age, any willingness to employ 
persons over that age seems of special interest. 
Table 1 shows, as might be expected, that the 
greatest number of these extra workers were 
hired when in the 16-30 age group. However, 
an appreciable number were employed for the 
first time when over 45, 126 or 17 per cent of 
the sales group, 153 or 13 per cent of current 
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Table 1 


Number and Percentage of Extra Workers of Each Sex in a Department Store 
Who When Hired were in Various Age Groups 


Number and Percentage in Each Age Group 


Current Sales 


Current Non-sales 


Former Extras* 


Fv Male Female Male Female Male Female 

en 

Hired No. % No % No. % No % No % No. % 
14-30 63 300 464 258 382 78 460 69 
31-45 34 23 192 32 104 16 173.34 71 14 157 23 
46-60 7 98 17 56 9 7915 31 6 47 7 
61 up 4 3 7 1 11 2 7 1 10 2 5 1 
Total 149. 100 597 100 635 100 517 100 494 100 669 100 


* Former extra workers had at the time of this study terminated their relations with the company, while 


current extras were stil] available. 


Included in the study were only those former exira workers (either sales or 


non-sales) who had been on the rolls at some time during 1950 but dropped out during that year. 


non-sales employees, and 93 or 8 per cent of 
former extra workers. In fact, a total of 44 
were employed for the first time when over 60, 
and 7 were hired by this company for the first 
time when they were over 65! Here surely is 
evidence of a willingness to take on persons so 
old that many firms would at once have turned 
them away. 

How satisfactory did the older workers turn 
out tobe? Table 2 shows a first appraisal of all 
these extra workers in terms of the length of 
time they were available for employment. It 
should be explained in this connection that it 
costs the store about fifty dollars to train a new 
employee and that a frequent turnover is ob- 
viously inefficient. In general, the longer an 
extra worker is available on call, the more satis- 
factory he is as an extra worker; he can be ob- 
tained when needed, and is prepared from 
training and experience at once to step into 
whatever he is asked to do when called. The 
section marked “Average Days Available” 
gives this information in terms of work days, 
assuming 26 work days to the month. The 
findings here are strikingly consistent, for both 
sexes and all three classes of workers. In all 
instances, those in the youngest age group are 
available for the shortest average time, and 
there is a regular progression to the oldest 
group who are available for the longest time. 
Thus the 7 current saleswomen over 60 years of 
age when hired, had been available an average 


of 412 days or well over a year of 312 work 
days; and since they were current workers still 
available at the time this study was made, it 
may be assumed that their total service to the 
company will last yet longer.! 

This longer availability of the older person 
is understandable. Younger workers either 
move to a full-time job or for other reasons do 
not continue in occasional or extra work. On 
the other hand, older people may be glad to 
have work which is only occasional. A mar- 
ried woman may wish to supplement the 
family income, though she does not feel that 
she can take a full-time job. An older person 
may not have the vigor for full-time work, but 
find episodes of extra work possible. A person 
on inadequate pension might thus earn a sup- 
plemental wage. But whatever the factors, 
the longer availability of the older extra work- 
ers seemed clearly a characteristic in their 
favor. 

Though longer availability appeared to be a 
distinctive merit of older people for extra work, 
a more positive evidence of their satisfactori- 
ness in that work was desired. Fortunately 
the personnel records included notations as to 


1 At first thought it seems inconsistent that the former 
extra workers, who had completed their time with the 
company, had the shortest average times available. 
But there was evidence to indicate that a good many 
of these former workers had separated from the com- 
pany because unsatisfactory, and so after a shorter time. 
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All records were therefore 
gone over to determine the number of persons 
obtaining any increase, and number of in- 
creases each obtained. The company em- 
phasized that wage increases for these extra 
workers were based on merit, not seniority or 


wage increases. 


age. Of the total 3,061 cases, only 13 per cent 
were given raises in wage. That only 7 per 
cent of the former extra workers obtained 
raises as compared with 17 per cent of the other 
two groups is further evidence of the previ- 
ously mentioned probable less competence of 
some of the group who had dropped out. How- 
ever, the important finding can be summarized 
in two pairs of percentages. Of all the former 
or current sales or non-sales extra workers who 
were hired when 45 or under, only 11 per cent 
got any wage increase, as contrasted with 28 
per cent of those hired when over 45. . Of the 
first group, 2 per cent were given two or more 
increases as compared with 10 per cent of the 
older group. Table 3 gives the detail. 

Particularly noticeable is the consistent rise 
from the youngest to the oldest ages, for all 
three types of women workers. Those hired 
when under 31 least often received wage raises, 
and the percentages so favored grow in each 
succeeding age bracket until those hired when 
over 60 show over half thus recognized as 
good. 

Table 2 


Average Days Available* of Extra Workers of Each Sex, 
in Each Age Group As to Time of Hiring 


Average Days Available 


Current Current Former 
Age Sales Non-sales Extra Workers 
When 
Hired M F M F M F 
14-30 105 103 119 100 60 61 
31-45 109 196 248 69 94 
46-60 264 305 463 217 103 84 
61 up 331 412 476 312 192 276 


* Number of work days, assuming 26 work days a 
month and 312 work days a year. For current em- 
ployees, days available was number of work days from 
the date of hiring to January 13, 1951 when the writer 
“closed the books”; for former extra workers, days 
available was number of work days from date of hiring 
to date of severing connections with the company. 
Thus to determine days available, for the total 3,061 
cases, was a somewhat laborious task; but the informa- 
tion was of major importance for appraising extra 


workers. 
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Table 3 


Percentages* of Extra Workers Hired at Various Ages, 
Who Received One or More Wage Increases 


Percentage Receiving One or More 
Wage Increase 


Current Current Former 
Age Sales Non-sales Extra Workers 
When 
Hired M F M F M F 
14-30 12 11 8 16 5 6 
31-45 3 21 21 25 10 11 


46-60 33 6D 13 13 


* The reader should note from Table 1 that the per- 
centages are based on small numbers of cases hired in 
the older ages; thus only 17 current sales men and 98 
women were first hired when between 46 and 60, and 
only 4 men and 7 women when over 60. But Table 2 
also has relevance here. At the time of this study all 
the cases were, of course, older than their ages when 
hired; and those hired when older were proportionally 
more old, because they had been with the company 
longer. If the data in the above table had been grouped 
according to age at time of study, the number of older 
workers, and number receiving wage increases, would 
have been even larger. 


In view of the relatively small number of 
those hired for the first time by this company 
when they were over 45 (372 or 12 per cent of 
the total, with 44 or a bit over 1 per cent hired 
when over 60), inferences must be made with 
caution. In general, persons hired when over 
45 are probably selected with extra care. At 
least such findings in favor of these older work- 
ers (their longer availability, and more frequent 
earning of wage increases) seem to justify the 
policy of the company in thus taking them on. 
But it is hard to believe that the selection was 
so perfect that all good older workers were 
found. Surely the results would warrant the 
conclusion that there are probably many older 
persons who could serve satisfactorily in work 


of the general type investigated.? 


Implications 


It seems inevitable that with the growing 
defense effort, full-time employment will be 


2In an effort further to study these extra workers, 
detailed tabulations were made as to previous employ- 
ment, as they had reported in their personnel form. 
These reports were found inadequate, especially for the 
women (thus housework might or might not be listed 
as previous work). But in general the older workers 
did not show such extensive previous unemployment 
as would suggest incompetence. 
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available perhaps as never before, and finding 
people for extra work will become increasingly 
difficult. The evidence presented above sug- 
gests that older people may well be used to 
meet this emergency. 

If there is to be more use of persons past 65, 
extra work would seem indicated. As sug- 
gested in the beginning, it seems highly desir- 
able that these older people should continue in 
some economic activity, and desirable for their 
mental health that they do so. There might 
well be a survey as to types of work for which 
such people might seem especially suited. As 
mentioned earlier, the company here studied 
was found to have hired for the first time 7 
persons over 65, and other extra workers, 
hired when slightly under this age, have con- 
tinued past 65. And it may well be added that 

the company also had some 70 regular full- 
- time employees over the 65 mark. 

A search for opportunities for older women 
seems especially needed. Not only do women 
live longer than men; this fact, plus the tend- 
ency of women to marry men older than they, 
brings it about that two-thirds of all women 65 
and over are either widowed, single, or di- 
vorced and may need some work. Types of 
work that older women can do, and into which 
they might go even after many years as house- 
wives, are needed. The excellent record of the 
women in this study seems suggestive in this 
connection. 

Summary 


The large and increasing numbers and pro- 
portions of older people in this country, and 
the growing tendency to retire workers at an 
arbitrary age, before the usefulness of many 
of them has ended and while they still desire 
and need to work, combine to make important 
a search for types of employment suitable for 
older persons and investigation as to their 
satisfactoriness in such work. And the mount- 
ing labor shortage makes insistent the need for 
finding all possible sources of workers; people 
in the older ages presumably might well be a 
major source in this situation. This paper 
reports certain data bearing on these problems. 


1. The study had to do with some 3,000 extra 
workers in a large department store. Some of 
these were sales people, some in non-sales work, 
and some had been in one or the other type of 
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work but had severed their connection with the 

company during 1950. 

2. The largest single group of these extra 
workers were under 31 when hired. However, 
a considerable number were first hired by this 
company when over 45, an appreciable number 
over 60; and 7 were over 65 when employed for 
the first time by this concern. Such willing- 
ness to take on new workers at these ages was 
considered highly desirable, and made espe- 
cially interesting the evaluation of these older 
workers in comparison with the younger. 

3. Since it costs about fifty dollars to train 
a new worker and rapid turnover is inefficient, 
the length of time that an extra worker is avail- 
able is an important criterion of his satisfactor- 
iness. It was found that, in general, the older 
the worker the longer he was available; thus 
women extra sales people 30 years old or 
younger were availiable an average of 103 work 
days as compared with a 412 day average for 
those over 60. 

4. Wage increases (which were made on 
merit, not seniority) were obtained by older 
extra workers more often than by younger 
extra workers. Thus 11 per cent of the extra 
sales women under 31 obtained an increase as 
compared with 57 per cent of those over 60. 

5. It is concluded that the mounting defense 
effort will increasingly make younger extra 
workers hard to get but that older workers may 
fill many of these positions better, and that the 
situation may increasingly bring employment 
of persons even past 65 for such occasional 
work, 
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This survey of opinions of the deportment of 
salespersons was brief both as to questions 
asked and numbers consulted. There was no 
thought of making an exhaustive study or of 
securing anything more than an indication of 
what male college students feel are “favorable” 
and “unfavorable” qualities of salespersons. 
The report is presented with all of its limita- 
tions recognized in the hope that others will 
conduct better controlled surveys to provide 
adequate knowledge in this neglected area of 
applied psychology. 

The group surveyed included 100 men stu- 
dents of The University of Texas of at least 
sophomore rank. Forty-five were engineering 
students; twenty-five were Austin Presbyterian 
Seminary students taking special courses in 
The University; and thirty were business ad- 
ministration majors. 

Only two questions were asked and each in- 
dividual made his response in writing after 
having had one day to think about it. The 
two questions were: 


1. What one thing impresses you most favor- 
ably about a salesperson’s speech and deport- 
ment? 

2. What one thing impresses you most un- 
favorably about a salesperson’s speech and de- 
portment? 

The responses were classified under ten 
broad heaslings and then listed, as shown in this 
report, in decreasing order of importance. The 
percentage of choice for each is given. When 
an individual responded to either question with 
more than one idea, the first listed was con- 
sidered the most important and chosen for the 
report. 

The findings certainly are not surprising 
nor unexpected. It is probable that a similar 
survey of several thousand men of varying ages, 

* In his capacity as a university speech teacher, the 
author has been invited annually for several years to 
speak to the employees of the E. M. Scarbrough De- 
partment Store, Austin, Texas, on some phase of speech 
relative to selling. The present report is an attempt 


to gather opinions from male customers to pass on to 
the employees. 
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professions, and occupations would produce 
similar results, although a more representative 
sampling should be obtained to ascertain the 
facts. 

Significant is the fact that 94 per cent of the 
responses related to a mental or personality 
quality; only 4 per cent referred to speech; 
only 2 per cent to dress (neatness); and none 
mentioned physical features. Thorough 
knowledge of merchandise is regarded as a 
favorable quality in 10 per cent of the responses 
and lack of knowledge of merchandise ranks 
third (again 10 per cent) in the list of unfavor- 


Table 1 


Rating of Favorable Qualities in Salespersons 


Percentage 

Favorable Qualities Choosing 
1. Friendly attitude 35 
2. Desire to be of help 15 
3. Courteous manner 15 
4. Interest in customer and work 10 
5. Thorough knowledge of merchandise 10 
6. Patience 4 
7. Distinct, correct speech 4 
8. Promptness 3 
9. Neat appearance 2 
10. Sincerity 2 

\ Table 2 


Rating of Unfavorable Qualities in Salespersons 


Percentage 

Unfavorable Qualities Choosing 
. High-pressure salesmanship 50 
. Overly helpful 18 
. Lack of knowledge of merchandise 10 
. Lack of interest in customer or material 9 


. Overly friendly 5 
Ignoring customer’s opinion 3 
. Slurred speech 2 
. Watchdog attitude 1 
. Ignoring customer’s presence 1 
. General indifference 1 
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able qualities. This argues. for the use of 
merchandise knowledge tests in selection and 
training of salespersons. 

It would be interesting to know to what de- 
gree women customers differ on these points, 
especially on dress and physical features. 


Typical Comments 


The following quotations are typical of the 
original responses. 


“T do not like to be high-pressured or rushed 
into a choice.” 

“The thing that I notice most about a good 
salesman is that he talks about other things 
than trying to force a sale upon the customer. 
While the customer is mainly interested in 
buying, he also enjoys talking to the salesman 
about football, for instance. The salesman 
should not make the customer feel as if he is 
trying to rush him to a decision.” 

“Good usage and correct English is usually 
employed, and it is desired.” 

“The fawning salesman who is constantly 
dogging your heels, looks at you like he thinks 
you might run off with the stock, is a pain in 
the neck.” 

“T notice first of all whether or not the clerk 
is courteous.” 

“Neatness is the one thing I think should be 
maintained at all times.” 

“Shopping is simplified and more enjoyable 
when the various departments are easily found. 
That is, all clerks and salespeople should have 
a thorough knowledge of all departments or at 
least know the’r location in the store.” 

“T dislike the salesperson who pays no at- 
tention to my ideas or how I want something 
to fit or look. Granted that he knows more 
about the merchandise than I do, it still 
doesn’t give him the right to ignore my wishes 
completely.” 

“T have noticed that many clerks these days 
don’t seem to be sincere in wanting to satisfy 
the needs of a customer. I suppose that is a 
bit of a hangover from the last war when stores 
didn’t really need salesmen to sell things; that 
is, that the demand was so great and the supply 
was so small that anything and everything 
would sell itself. This situation has almost 
entirely corrected itself, but there are a few 
clerks that don’t seem too interested as yet.” 
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“My ‘pet peeve’ concerning department 
store service is an attitude by clerks which ap- 
parently indicates a lack of knowledge concern- 
ing the goods for sale in their departments. 
Oftentimes a clerk will show complete ignorance 
of the quality, function, etc., of items in their 
charge.” 

“Courtesy is one of the most essential factors 
in selling merchandise and breaking down sales 
resistance. The more courteous a salesperson 
is, the harder it is for the customer to refuse to 
buy.” 

“The sales people I have encountered at 
various stores have often been too helpful, and 
they leave me with the impression that they be- 
lieve each person who enters the door knows 
exactly what he desires. Actually, a lot of 
people are suggested of their desires by inter- 
esting displays in the store. A good clerk can 
catch a prospective buyer’s interest ‘in a spe- 
cific item and at the proper time approach him 
to further interest the buyer and to complete 
the sale. Clerks should realize people do like 
to ‘look around’!” 

“T like a salesperson with a pleasant voice, 
a good sense of humor, and an interest in my 
problem.” 

“T like a salesperson who acts as though he 
were a friend who had come shopping with me 
to advise rather than as a person employed to 
sell me an article.” 

“Things I look for in a salesperson are: his 
or her friendly manner, a feeling of being 
welcome to that particular department, a feel- 
ing that the salesperson is interested in my 
particular need, and that the salesperson is like 
a friend you go to when you need something. 
I don’t want to be high pressured into buying 
anything; if I buy an article I want to feel that 
it is what I wanted and needed.” 


Summary 


Two conclusions from the results presented 
here would be: first, the majority of men cus- 
tomers like most of all in a salesperson the 
qualities: friendliness, courteous manner, and 
helpfulness; and, secondly, the majority of men 
customers dislike most of all high pressure 
salesmanship and the “gushy,” over-helpful 
salesperson. 


Received January 26, 1951. 
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Much has been said and written about the 
possibility of “fatigue” while an interviewer 
and respondent leaf through a magazine to 
determine which advertisements have been 
seen and read (1, 2, 3, 3a, 5). The existence of 
interviewing fatigue has been established. 
But, it is also reasonable to assume that there 
is another sort of fatigue: that which takes 
place when a person is reading a magazine. To 
take an extreme example: it is highly improb- 
able that a person will give the same amount of 
attention to each of the advertisements in the 
issue of a magazine with 1,000 advertisements 
as he will in an issue with only one advertise- 
ment. There must be reading fatigue as well 
as interview fatigue. The purpose of this 
analysis was to measure the effect of each type 
of fatigue, determine its relative importance, 
and to investigate the possible determinants’of 
each. 


Procedure 


The combined effect of both types of fatigue 
is obtained when the average reading of “thick”’ 
issues is compared with “thin” issues. A 
thick issue has many advertisements—a thin 
issue has only a few. The reader is confronted 
with more advertisements when originally 
reading the thick magazine, and also when he is 
interviewed. Therefore, in order to evaluate 
the effect of one type of fatigue, it is necessary 
to hold the other type constant. Since the 
length of the intervicwing period was more 
easily manipulated, this was allowed to vary. 
The length of the reading period was then held 
constant. 

This was accomplished by “splitting” the 
interviewing period into half: each respondent 
was questioned over only one half of the adver- 
tisements in the magazine. In this case, he 
had covered the entire magazine in his original 
reading, but was interviewed on just half of it 
(split). These results were then compared 
with those persons who read the entire maga- 
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zine (non-split). The differences between 
these two situations would then be due to inter- 
viewing fatigue, since reading fatigue was kept 
constant. 

Control of Variables 


Identical idvertisers were used in the two 
comparisons. Size and color of the advertise- 
ments were kept constant. Position of the 
advertisements in each magazine was con- 
trolled, since there was a tendency for an ad- 
vertiser in thin issues to receive better position 
and to be on a right hand page. Insofar as 
possible, seasonal differences in reading inter- 
est were controlled by selecting advertisers of 
products of uniform interest throughout the 
year. Two women’s monthly magazines were 
used to insure that the results would not be 
unduly influenced by special situations pre- 
vailing within one magazine. The average 
number of advertisements in the issues in 
question was kept constant, except when vari- 
ation was called for by the experimental design. 


Experimental Design 


The average number of advertisements, half 
page or larger, was 165 in thick issues and 74 
in thin issues of Magazine A. The average 
number of advertisements in the selected issues 
of Magazine B was 198 in thick issues and 91 
in thin issues. Since the respondent was inter- 
viewed over all the advertisements in the maga- 
zine in both thick and thin issues, any difference 
in readership between the two sizes of issues 
would be due to the effect of original reading 
plus interview fatigue. 

This was not the case for split and non-split 
issues. The average number of advertisements 
appearing in Magazine A was 153 for the split 
issues and 166 for the non-split issues. The 
average number of advertisements in Magazine 
A that were covered during the interview wéfe, 
respectively, 76.5 and 166. For Magazine B 
there were 175 advertisements in the split issues 
and 200 in the non-split issues, but 87.5 and 
200, respectively, were covered during the 
interview. In the thick and thin issues the 
number of advertisements covered during the 
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Effect of Fatigue on Readership of Magazine Advertising 


original reading and the interview were equal. 
While this situation held true for the non-split 
issues, it did not for the split issues. This 
difference in number of advertisements repre- 
serited a difference in interview fatigue which 
was measured by readership scores. 

The non-split and split issues appeared in 
the same months but different years. It has 
been shown previously (1) that when condi- 
tions were constant, the level of reading in 
1948 was equal to that in 1949. Notice that 
the number of advertisements covered during 
the interview for the split issues was half the 
number covered during the original reading. 
Also, note that in the split and thin issues the 
average number of advertisements that were 
covered during the interview was about equal, 
as well as the other averages within each 
magazine. 

There were six split issues and six non-split 
issues for both magazines. The split issues 
appeared in 1949. The non-split issues ap- 
peared in the same months of 1948. The 
one page, four color averages for the split and 
non-split issues of Magazine A were based on 
70 advertisements and 10 advertisers. The 
averages for Magazine B were based on 38 
advertisements and 10 advertisers. The one 
page, four color averages for the thick and 
thin issues were based on 103 advertisements 
and 10 advertisers for Magazine A and 69 
advertisements and 10 advertisers for Maga- 
zine B. The thick and thin issues appeared 
from 1945 to 1949. For Magazine A, 11 thick 
and 11 thin issues were used. For Magazine 
B, 9 thin and 10 thick issues were used. 

The one page, black and white averages for 
Magazine A were based on 112 advertisements 
and 8 advertisers for the thick and thin issues, 
and 58 advertisements and 8 advertisers for 
the split and non-split issues.. Similar aver- 
ages for Magazine B were based on 66 adver- 


425 


tisements and 10 advertisers for thick and thin 
issues, and 44 advertisements and 10 advertisers 
for split and non-split issues. 


Results 


Three measures of readership were used: 
Noted, Seen-Associated and Read Most. They 
are defined as follows: 


Noted. Per cent of readers of the magazine 
who remembered having seen the advertise- 
ment. It is cumulative and includes Seen- 
Associated readers. 

Seen-Associated. Per cent of readers of the 
magazine who remembered not only having 
seen the advertisement, but also had associ- 
ated it with the product or advertiser. It is 
cumulative and includes Read Most readers. 

Read Most. Per cent of readers of the 
magazine who had read fifty per cent or more 
of the words in the advertisement. 

For each issue 200 women were interviewed 
concerning the advertisements they had seen 
and read. 


‘One Page, Four Color Advertisements 


Table 1 gives the results expressed in terms 
of the per cent of increase in readership in the 
thin and split issues. For Magazine A there 
was 15 per cent higher Noted readership in the 
thin issues than in the thick issues. There was 
also 15 per cent higher Noted readership when 
reading fatigue was held constant. There- 
fore, all of this increase in readership can be 
attributed directly to interview fatigue. 


Table 1 


Relative Importance of Reading and Interview Fatigue (One page, four color advertisements) 


Reading and Interview Fatigue 
Interview Fatigue Alone 
100X(Thin/Thick)- 100X(Split/Non-Split)- Per Cent Due to Per Cent Due to 
100 100 Interview Fatigue Reading Fatigue 
Maga- Maga- Maga- Maga- Maga- Maga- Both Maga- Maga- Both 
zine zine zine zine zine zine Maga- zine zine Maga- 
B A B. B zines A zines 
Noted 15% 11% 15%* 11%* 100% 100% 100.0% 0% 0% 0.0% 
Seen-Associated 17 10 az” 10* 71 100 85.5 29 0 14.5 
Read Most 28 34 4** ae 14 15 14.5 86 85 85.5 


* Not significantly different from combined reading and interview fatigue. 
** Significantly different from combined reading and interview fatigue at 1% level. 
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The differences between reading plus inter- 
view fatigue, and interview fatigue alone for 
Noted and Seen-Associated were not significant 
for either magazine. In other words, inter- 
view fatigue alone accounted for all the differ- 
ences between these two experimental condi- 
tions. 

For Read Most, interview fatigue failed to 
account for all of the difference. The increase 
due to interview fatigue alone (4 per cent and 
5 per cent) was significantly different from the 
increase due to both types of fatigue. There- 
fore, reading fatigue was a major factor. 

The conclusion for one page, four color ad- 
vertisements is that interview fatigue mainly 
affects Noted and Seen-Associated readership, 
while reading fatigue mainly affects Read Most. 


One Page, Black and White Advertisements 


The increases in the readership of thin and 
split issues are given in Table 2. For Maga- 
zine A there was an increase of 19 per cent in 
Noted readership of the thin issues. This 
compares with an increase of 9 per cent in the 
split issues. This difference is significant at 
the 5 per cent level. The inference is that 
interview fatigue alone cannot often result in 
differences as large as was found, Therefore, 
it is apparent that both types of fatigue may 
be operating in the readershiy) of thick and 
thin issues. The amount that was found to be 
due to interview fatigue was 47 per cent as 
compared with 53 per cent for reading fatigue. 


The same general pattern existed for the other 
magazine and for Seen-Associated readership. 

The differences for Read Most were greater 
and significant at the 1 per cent level. This 
indicates that in very few cases could inter- 
view fatigue alone account for the obtained 
differences. Therefore, Read Most scores in 
thin issues were mainly higher because of 
lessened reading fatigue. The amount that 
was found to be due to reading fatigue was 
about 75 per cent of the total difference. 

The conclusion for one page, black and 
white advertisements is that interview and 
reading fatigue each account for about half 
of the increase in Noted and Seen-Associated 
readership in thin magazines. About 75 per 
cent of the increase in Read Most in thin issues 
is accounted for by reading fatigue. The re- 
maining 25 per cent is due to interview fatigue. 


Comparison Between Black and White 
and Four Color Advertisements 


For Noted and Seen-Associated readership of 
four color advertisements, interview fatigue 
accounted for almost all of the increase in thin 
issues. However, it accounted for only half 
of the increase found for one page, black and 
white advertisements. In other words, casual 
readership of black and white advertisements 
is less affected by interview fatigue than four 
color advertisements. 

There was little difference between page, 
black and white and page, four color advertise- 


Table 2 


Relative Importance of Reading and Interview Fatigue (One page, black and white advertisements) 


Reading and 

Interview Fatigue Alone 

100X (Thin/Thick)- 
100 


100X(Split/Non-Split)- 
100 


Interview Fatigue 


Per Cent Due to 
Interview Fatigue 


Per Cent Due to 
Reading Fatigue 


Maga- Maga- Maga- Maga- 
zine zine zine zine 
A B A B 


Maga- Maga- Both Maga- Maga- Both 
zine zine Maga- zine zine Maga- 
A B zines A B zines 


Noted 19% 13% 9%** 8% 
Seen-Associated 21 14 8* 
Read Most 31 25 ~~ 


47% 62% 545% 53% 38% 45.5% 
48 57 52.5 52 43 47.5 
13 36 24.5 7 of 75.5 


* Significantly different from combined reading and interview fatigue at 20% level. 
** Significantly different from combined reading and interview fatigue at 5% level. 
*** Significantly different from combined reading and interview fatigue at 1% level. 
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ments for Read Most. Each type of advertise- 
ment was about equally affected by interview 
fatigue and reading fatigue. The latter ac- 
counted for 75 per cent to 85 per cent of the 
increase in thin issues. 


Discussion 


Interviewing fatigue mainly affected Noted 
and Seen-Associated readership. This inter- 
viewing fatigue was probably caused by either 
the interviewer or respondent failing to give 
equal attention to all advertisements during 
the interview. Whenever there is a difference 
between the attention given to advertisements, 
there is a difference in readership. If the 
interviewer and respondent pay less attention 
to the advertisements covered at the end of 
the interview, reported reading of them will be 
decreased. However, if a respondent has 
thoroughly read an advertisement, the chances 
are excellent that he will spontaneously bring 
it to the attention of the interviewer. This 
probably explains why interviewing fatigue 
affected casual observation to a far greater 
extent than Read Most. 

It takes a minimum of .1 to .2 of a second 
to note an advertisement, but the average 
person will read copy at the rate of one word 
every .2 of a second or five words a second. 
Therefore, it takes a much longer time for a 
person to read the copy than to note the ad- 
vertisement. Because of the small fractions 
of time involved in noting an advertisement, 
there should be little reading fatigue. Because 
of the large (relative) amount of time required 
to read the copy, there should be considerable 
reading fatigue for Read Most. 

This explanation also implies that those ad- 
vertisements with little copy will suffer less 
from reading fatigue than advertisements with 
a great deal of copy. 


The above discussion serves to explain the - 


difference between the role of the fatigue factors 
for the three levels of readership. However, 
there remains the question of the difference 
between the role of interview fatigue for one 
page, four color and one page, black and white 
advertisements. Interview fatigue was an 
exclusive factor in the Noted readership of one 


page, four color advertisements but accounted 
for only half of the difference in the Noted 
readership of one page, black and white ad- 
vertisements. 

The main functional difference between these 
two types of advertisements was their level of 
readership. The average Noted readership of 
the four color advertisements was about 40 per 
cent, as compared with 30 per cent for the black 
and white advertisements. Probably most of 
the 10 per cent who on the average saw the 
color advertisement and not the black and 
white advertisement, did so because of the 
mechanical attention value of the four colors. 
It has been shown previously (4a) that the 
persons who are added as readers when the 


size of an advertisement is increased, tend to be ° 


non-users of the advertised product. Size is a 
mechanical factor in stimulating attention. 
Color is also a mechanical factor. In other 
words, the additional 10 per cent who saw the 
four color advertisement were very apt to be 
non-users of the advertised product. 

Any difference between the role of interview 
fatigue for these two types of advertisements 
must come from within this 10 per cent who 
saw one and not the other. In other words, it 
is very possible that the additional interview 
fatigue came mostly from within the group of 
readers who were non-users of the advertised 
product. This receives confirmation from the 
first part of the discussion which mentioned 
failure to give equal attention to all advertise- 
ments as a factor in contributing to interview 
fatigue. The respondents who would be most 
apt to give “less” attention to a particular ad- 
vertisement during an interview would be those 
with the least interest in the advertised prod- 
uct. Such persons would most likely be 
non-users. Furthermore, there was little dif- 
ference for Read Most between the two types of 
advertisements on the role of the fatigue 
factors. Those persons who read most of an 
advertisement are the ones most interested in 
the product; so there should be little interview 
fatigue. Since the product interest of those 
who Read Most was “equal,” interview fatigue 
should be equal for both types of advertise- 
ments. This was the actual situation that was 
found. 
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Interview Fatigue Among 
Users and Non-Users 


A second analysis was made of the Seen- 
Associated readership of men for automobile 
advertisements appearing in a weekly magazine 
between 1947 and 1950. The issues were 
separated into two groups: those with 100 and 
more half page or larger advertisements and 
those with 50 and less advertisements. The 
average number of advertisements was 105.0 
per issue for 16 thick issues, and 44.1 per issue 
for 19 thin issues. Since 150 men respondents 
were interviewed on each issue this was a 
sample of 2,400 and 2,850 interviews, respec- 
tively. 

The readership of each automobile advertise- 
ment among owners and non-owners of the 
advertised brand was computed for the thick 
and thin issues. The thin issues were 19 per 
cent higher than the thick issues for Seen- 
Associated readership among all men. For 
non-owners (non-users) this figure was also 
19 per cent. But for owners, the increase was 
only 9 per cent. This difference was signifi- 
cant at the 5 per cent level of confidence. 

It was shown before that most of the fatigue 
in high readership advertisements was due to 
interview fatigue. The level of Seen-Associ- 
ated readership of the automobile advertise- 
ments was very high, being above 50 per cent, 
so that it can be -safely assumed that the 
fatigue in these advertisements was mostly 
interview fatigue. 

Therefore, the most interested persons, 
owners or users, had significantly less interview 
fatigue in thick issues than the least interested 
persons, non-owners or non-users. As the 
issues of a magazine become thicker, those 
persons with low interest in the advertised 
product will be more affected by interview 
fatigue than those persons with high interest. 

Other implications are that as the level of 
readership decreases, the interview fatigue de- 
creases. This latter point has to be related to 
other factors such as size of space, size of po- 
tential, and actual market and type of appeal 
that was used. These four points all affect 
the level of readership and so affect the extent 
of interview fatigue. 

The type of appeal that is used in an adver- 
tisement is important since it determines the 
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selectivity of the advertisement in attracting 
users or non-users of the advertised brand or 
product (4, 4b). For example, an advertise- 
ment which prominently displays the package 
or merchandise tends to select as “noters” 
those persons with high interest in the product. 
Those advertisements which contain an appeal 
which is irrelevant to the advertised product 
tend to’select a disproportionate number of 
persons with low interest in the product. 
Therefore, there should be higher interview 
fatigue for advertisements with irrelevant 
appeals, and lower interview fatigue for ad- 
vertisements with relevant and selective ap- 


peals. 
An Example of Effects of Fatigue 


In order to assess the practical importance 
of fatigue factors, a hypothetical example is 
illustrated that uses the fatigue ratios reported 
in Table 1. 

An advertiser has a one page, four color ad- 
vertisement in an issue with 77 advertisements 
(thin). The advertisement receives 35 per 
cent Noted, 30 per cent Seen-Associated, and 10 
per cent Read Most readership. Later, he 
inserts the same advertisement in a thick issue 
(166 advertisements) of the same magazine. 
This time the advertisement receives 31 per 
cent Noted, 26.4 per cent Seen-Associated, and 
7.6 per cent Read Most readership. The sec- 
ond insertion is lower because of reading and 
interview fatigue. The entire drop of four 
per cent in Noted readership is attributable to 
interview fatigue. The Seen-Associated reader- 
ship has a drop of 3.6 percentage points of 
which 3.1 are attributable to interview fatigue 
and 0.5 to reading fatigue. The Read Most 
has a drop of 2.4 percentage points, of which 
0.3 are due to interview fatigue, and 2.1 are 
due to reading fatigue. 

The “true” size of the audience that Noled 
the second advertisement was 35 per cent, no 
change. The “true” Seen-Associated audi- 
ence was 29.5 per cent, a drop of .5 per cent. 
The “true” Read Most audience was 7.9 per cent, 
a drop of 2.1 percent. It must be remembered 


that the scores for the advertisement in the thin 
issue and the adjusted scores for the advertise- 
ment in the thick issue still contain both read- 
The effect of the 


ing and interview fatigue. 
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adjustment is to make the scores of the two 
advertisements comparable on a relative scale. 
One last point is that this analysis has not 
established any absolute measurement of read- 
ing and interview fatigue. It has merely shown 
the contribution of each type to the over-all 
fatigue which takes place in thick versus thin 
issues. 
Summary 


An experimental design was devised to 
measure the relative importance of original 
reading fatigue and interview fatigue upon 
reading of advertisements in thick and thin 
magazine issues. 

The findings indicate that when the number 
of advertisements in a magazine doubles— 


1. Interview fatigue accounts for 53 per cent 
to 100 per cent of the decrease in casual ob- 
servation scores—Noted and Seen-Associated, 
and 15 per cent to 25 per cent of the decrease 
in thorough reading—Read Most. 

2. Reading fatigue accounts for 0 per cent 
to 46 per cent of the decrease in casual obser- 
vation scores—Noted and Seen-Associated, and 
76 per cent to 86 per cent of the decrease in 
thorough reading—Read Most. 

3. Interview fatigue is relatively less im- 
portant for Noted and Seen-Associated reader- 
ship of one page, black and white advertise- 
ments than of one page, four color advertise- 
ments. 
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4. Read Most reading fatigue is practically 
unaffected by color of the advertisement and . 
seems to be mainly a function of the number of 
advertisements in the issue. 

5. Interview fatigue is a function of interest 
in the brand and product. Those persons with 
the greatest interest in the advertised brand 
show the least interview fatigue. 
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Aesthetic Preference for Isosceles Triangles . 


Thomas R. Austin and Robert B. Sleight 


The study of form preference is not only ap- 
plicable to designing—the design of a product 
often playing a major role in what a customer 
will buy—but also to other related fields such 
as visual displays and consumer research. 

In preference studies involving various geo- 
metric figures, there is often found a certain 
ratio of altitude to base (the so-called golden 
section) which is said to influence preferences 
for the figures. In other words, when subjects 
are asked to choose the figure they prefer, from 
a series of figures, the preference yields the 
golden section ratio, which mathematically is 
approximately 1.62 to 1. 

Witmer (2) experimented with isosceles 
triangles standing on their bases, his whole 
series involving triangles with altitudes shorter 
than bases. His observers ranked the figures 
in order of preference from charts, each chart 
containing only a portion of the total number of 
stimulus figures. Witmer found a choice 
centering around the ratio of 0.41 to 1 (altitude 
to base). 

Thorndike (1) using essentially the same 
method, but presenting triangles with altitudes 
longer than bases, found a preference centering 
around a ratio of 1.60 to 1 (altitude to base). 
This ratio closely approximates the golden 
section measure. 

The findings of these studies have been ques- 
tioned because of the possible influence of an 
artifact, the central tendency of judgment (3). 
The subjects supposedly preferred the triangle 
on each chart which was the central one on tnat 
chart; the choice determined not by preference 
alone, but by position as well. 


Purpose 


The purpose of this study was to quantify 
the consistency of preferences for simple geo- 


*This research was supported in part under the 
terms of a contract between Special Devices Center, 
Office of Naval Research, and The Johns Hopkins 
University, Contract N5-ori-166, Task Order I. This 
is Report No. 166-I-141, Project Designation No. 
NR784-001, under that contract. 


Psychological Laboratory, The Johns Hopkins University 


430 


metric forms (isosceles triangles) in order to 
clarify the findings of previous studies on the 
subject. 


Materials and Procedure 


Twelve triangles (altitude to base propor- 
tions ranging from 0.25” & 1” to 3” XK 1” by 
0.25" altitude steps) were combined in all 
possible combinations of two (66 pairs in all) 
and mimeographed on separate pages. The 
order of each pair of triangles on the pages, as 
well as the pages themselves in the booklets, 
was randomized in order to eliminate prefer- 


ence for a triangle merely because it was pre- © 


sented and chosen on the previous page, and 
preferences by left or right hand position. 

In the experimental situation, the subjects 
were instructed to check the triangle which 
they preferred on each page, not to omit any 
pages, and to work rapidly. The method of 
paired comparisons was used specifically to 
eliminate the effects of central tendency of 
judgment. 

A retest was given to the same subjects on 
the same material exactly one week later. 
Fifty-two undergraduate university students 
served as subjects. 


Results and Discussion 


In Figure 1, the triangles are plotted as a 
function of preferences. The preferences are 
recorded in terms of the per cent of times each 
triangle was chosen. 

Generally, the preferences fall in a modal 
range including triangles 1” X 1” through 2” 
X 1” (altitude to base). The triangles in this 
range are significantly more preferred than the 
remaining triangles at beyond the one per cent 
level of confidence, as determined by the criti- 
cal ratio of the difference between proportions. 
These proportions were based on the number 
of times each triangle was chosen out of the 
total number of times ‘hat triangle could have 
been chosen. 

The test-retest situation enabled determina- 
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ALTITUDE TO BASE RATIO OF TRIANGLES 


Fic. 1. Ratio of altitude to base (in inches) of 
triangles plotted as a function of preference for 52 
subjects. Preferences are recorded as per cent of times 
each triangle was chosen. Modal triangles are illus- 
trated at the top of the figure. 


tion of a measure of reliability between the 
first and second tests. Test-retest Rho’s were 
computed for each subject. Although the 
range of the Rho’s was quite large, the median 
Rho for all 52 subjects was 0.85, which indi- 
cates considerable consistency in preference. 


- Most of the Rho’s represent high positive cor- 


relations; however, some extreme low correla- 
tions were a result of a few subjects’ randomly 
checking triangles without trying to make 
preferences. 

The authors realize that any situation such 
as this incorporates forced judgments that may 
be highly artificial. The fact remains, how- 
ever, that when such judgments are made, 
they fall into a definite pattern of choice. 

The range of preferred figures lies in the cen- 
ter of the total range of stimulus figures, but 
this is not, we feel, due to the effect of central 
tendency. The method of paired comparisons 
should minimize this effect by exposing only 
two figures at a time in a random sequence. 

An interesting aspect of this experiment is 
an analysis of the best liked triangles. Few 
of the subjects chose triangles within the modal 
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range as most preferred. Most best liked tri- 
angles were at the extremes. Significant, 
however, is the fact that while many subjects 
preferred the triangles at one extreme, many 
other subjects had a definite dislike for these 
same triangles and made first choices at the 
other extreme. When the combined data were 
considered, these two tendencies cancelled each 
other out leaving extreme item preference 
scores low. In mo case were triangles within 
the modal range disliked by any of the subjects. 

In the light of this analysis, it might be more 
correct to say that the triangles in the modal 
range were not liked the best but disliked the 
least. Herein lies a problem that manufac- 
turers, designers, and others might wish to 
consider carefully. Should they offer the 
public a “whole range” of items from which 
everyone can select a best liked article; should 
they offer “extreme’”’ items to please some and 
displease others; or should they confine them- 
selves to the “modal” items, which, although 
they may not be the best liked articles, are the 
least disliked ones? 


Conclusion 


When isosceles triangles having constant 
bases and variable altitudes were judged for 
aesthetic preference, subjects chose with high 
consistency (median test-retest Rho = 0.85) 
triangles yielding ratios between 1 to 1 and 2 
to 1 (altitude to base). 


Received September 4, 1951. 
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100 MILLIMETERS 


INSTRUCTIONS Resolution is expressed i in terms of the lines peg millimeter recorded bya particularly 
ifilm under specified conditions. Numerals in chart indicate the ovlieetililees lines per millimeter in adjacent 
“T- -shaped”” groupings. 
In ‘microfilming, it is necessary to determine the reduction ratio and multiply the number of fines in tie 
chart by, this value to find the number of lines recorded by the film. As an aid in determining the reduction i 
bratio, the line above is 100 millimeters in length. Measuring this line in the film image and dividing the lengehim 
into 100 gives the reduction ratio. Example: the line is 20 mm. long the, film ithage, and 100/20 == 5. 


Examine “T-shaped” line: groupings in the film with microscope, and note the mimber adjacén 
lines fecorded sharply and distinctly. this nurhber by the reduction factor to obtain 
per millimeter. Example: 7.9 group of lines is clearly recorded while lines) in the 10.0 
separated. Reduction ratio is 5, and 7.9 x 5 = $9.5 lines per millimener’ recOndom 
aly. 10.0% 5 — 50 lines per millimeter which are not recorded satisfactorily. Under the 
Honssmeamimum resolution is between 39.5 and 50 lines per millimetery. 


Resoltition, as measured. on the film, is a test of the entire photographic systerti, includmeg exp: 
Processing, and other These rately “maximum Of: the film. “Vibrations. 
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